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INTRODUCTION 


Tus REPORT is organized for the most part in accordance with the plan 
used in the previous issue of the Review of Educational Research for Feb- 
ruary, 1933, which was devoted to educational tests and their uses. One 
important change has been made, however. The chapter on basic con. 
siderations was replaced by a consideration of tests and measurements in 
several foreign countries. A precedent for this policy exists in the recent 
issue of the Review of Educational Research devoted to psychological tests. 
Such reviews are also quite common in foreign periodicals. 

Acting upon the suggestions of many of our readers we have increased 
the amount of critical comment concerning the references quoted. 

While there is still some overlapping among the chapters, we think we 
have succeeded in restricting this feature to a minimum. We trust that 
we have succeeded at the same time in covering the entire field in an ac- 
ceptable manner. 

The chairman of the Committee wishes to take this opportunity to thank 
the other members for their faithful cooperation. Thanks are also due to 
Messrs. Maucker of Iowa, Frutchey of Ohio, and Mrs. Louise Mahone of 
Washington for their valuable contributions to the success of this number. 


W. J. Ospurn, Chairman, 
Committee on Educational Tests and Their Uses. 





CHAPTER I 


Educational Tests and Measurements in China, 
England, France, and Germany' 


Tue report that follows is reasonably complete for the countries rep- 
resented. Occasionally a report is also included from Holland and Poland. 


A. CHINA 


Tests and Measurements 


In China educational experimentation is new. The Chinese have but 
recently thought of education as an area of specialized and professionalized 
learning. Furthermore, for a long time the traditional examination system 
has seemed sufficient without any changes of consequence in its procedure 
or requirements. Recent contact with other countries, however, has chal- 
lenged the leadership of Chinese social, economic, cultural, and educational 
life to such an extent that some significant changes in the procedures of 
educational tests and measurements have been made. This movement began 
in 1922 when William A. McCall of Columbia University visited China 
as director of psychological research under the auspices of the Chinese 
National Association for the Advancement of Education. As early as 1913, 
and later in 1921, informal discussions on testing were conducted by such 
men as Liao Shih Chen, Chen Ho Ching, and Chang Yao Hsian; and in 
such institutions as the Higher Normal Schools of Peking and Nanking. 

By 1923, many tests had been standardized, covering such fields as in- 
telligence, silent reading, auditory comprehension, mixed Wenli and 
National Language, composition, formal handwriting, running hand- 
writing, fundamentals of arithmetic, arithmetical problems, fundamentals 
of algebra, algebraic equations, algebraic problems, integers, decimals, 
general science, Chinese geography, Chinese history, household arts, citi- 
zenship, health, and the teaching of English in Chinese schools. Many 
of these tests were similar to those used in America. To quote McCall, 
“The tests . . . merit, I believe, the conclusion that in every case they 
are equal, and in most cases they are superior to the like tests in America.” 

For a few years both intelligence and achievement testing developed 
in almost every phase of school work, yielding such results as mental and 
subject ages, correlations of abilities, objective technics of testing, and 
comparisons by subjects among individuals, classes, and other groups. 
In the period just previous to 1925, the emphasis on intelligence, language, 


1 Professor Weidemann reports for China, Mrs. Mahone for France, and Professor Osburn for England 
and Germany. 


443 











and arithmetic tests was most noticeable in the educational periodicals. 
There were also discussions of principles and technics of testing and 
measuring. 

Between 1922 and 1924, testing in the province of Kiangsi was under. 
taken. All of the public and private schools gave intelligence and achieve- 
ment tests. Teachers became aware of pupil differences, special classes for 
exceptional children were proposed, and a new professional interest among 
teachers seemed to have been created. From 1925 to 1927, interest in the 
movement decreased, but from 1928 to 1930 a revival of interest was 
evident. Since 1930, in such municipal centers as Nanking and Shanghai. 
the measurement of the intelligence and achievement of students has in- 
creased. In some experimental schools established by bureaus of education 
in cities or districts or affiliated with teacher-training institutions, the out- 
comes of instruction are regularly tested and measured. 


Educational Research Organizations 


Many institutions engage in some kind of educational research. In the 
normal schools, colleges of education, and departments of education in 
the universities, the teachers carry on research and collaborate with ad- 
vanced students. In the College of Education of Central National Univer- 
sity at Nanking, Professor Hi Wei has published a series of reports on 
research findings which exemplify educational research in such institu- 
tions. 

The best known organizations covering the general phases of educa- 
tional developments in China are: the Chinese National Association for 
the Advancement of Education, the National Association for Mass Edu- 
cation, the National Vocational Education Association, and the Chinese 
National Education Association. Each of these bodies has published ma- 
terials, chiefly of a statistical and survey character, relative to many 
different phases of Chinese education. 

Some educational research has been done in such provinces as Kiangsi 
and Shantung, and in such municipalities as Nanking and Shanghai. At 
Pin-Yang and Chow Ping some educational social experiments have been 
conducted. 

Since 1928, the Institute of Educational Research at National Sun Yat 
Sen University in Canton has been trying to devote its entire program to 
educational research of a scientific character and to assemble a collection 
of literature essential to such research. The productions of this institute 
include studies for the improvement of teaching Chinese, the discovery 
of better methods and contents for adult education, and the relationship 
between education and the economic life of the Chinese people. This 
institute also publishes the Chinese Journal of Educational Research and 
a series of research monographs. Its assembled references include an 
eight volume Jndex to Chinese Educational Periodicals and 34,000 volumes 
in Chinese and other languages consisting of books, magazines, reference 
books, documents, research bulletins, and school textbooks (12). 
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Sample References to Recent Educational Research 


Through the representative of the Chinese Ministry of Education, Mr. 
Wei Hsueh-Chih, a number of recent documents of an experimental nature 
have been received (see bibliography). Mr. Chih’s statement indicates 
that through Dr. H. H. Hsiao of the Society of Tests and Measurements 
at Nanking, an authority on educational research in China, much in- 
formation may be obtained. The documents received are in Chinese. Many 
efforts were made to secure their translation, but no person was available 
who judged himself capable of translating the technical vocabulary. The 
references indicate something of the nature of the studies, but no evaluation 
of their content was possible. 


Summary 


The above description indicates that Chinese educational leaders are 
developing tests, measurements, and experimentation for Chinese boys 
and girls. They have established a basis for such development, have suc- 
ceeded to the point where whole provinces have used educational tests, 
have started to develop their own institutions adapted to solve Chinese 
problems, and have begun a movement toward education as a profession. 
Chinese and American educators should exchange educational research 
literature in order to further international goodwill and humanitarianism. 


B. ENGLAND, FRANCE, AND GERMANY 


Educational tests and measurements have developed much further in 
the United States than they have in Europe. From the European point 
of view our growth has been too extensive. European thought tends to be 
organismic and configurational in character. There is little sympathy for 
our atomism in psychology or pragmatism in philosophy. Educational 
tests do not thrive where elements or “atoms” are neglected. Pragmatism, 
with its insistence that anything is good if it works, encourages the growth 
of educational tests as a means of ascertaining to what extent our desired 
outcomes have been achieved. In Europe there is little or no emphasis 
upon educational outcomes as we know them. 

Whatever the complete explanation may be, the fact remains that 
European countries have not shown a marked interest in tests for general 
use in schools. No large publishing concerns in Europe devote a major 
portion of their attention to the publication of educational tests and meas- 
urements. On the other hand it would be erroneous to state that Europe 
is not interested in educational tests. Numerous workers in England, France, 
and Germany are using them for research purposes. 

Everywhere during recent years a remarkable interest in intelligence 
testing has developed. References to intelligence testing were collected 
through error in gathering the data which relate to France. Rather than 
lose the results of the work the references are included. All other articles 
relating exclusively to psychological tests were excluded. In many cases, 
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however, foreign authors make no clear distinction between psychological 
and educational tests. This will account for several citations that are 
apparently psychological in nature but which nevertheless devote con- 
siderable attention to tests and statistical procedures of an educational 
type. For this reason a few references are included which occur also in 
the June, 1935, issue of the Review of Educational Research relating to 
psychological tests. 










Criticism of Testing 


Since the testing movement is questioned in Europe, one would expect 
to find many articles of a critical nature. There are a few. Juhasi (87) dis- 
cussed the effect of the organismic movement on testing. Szondi (185) re- 
ported a critical and historical study of psychological tests which is 
applicable to educational tests as well. Thomson (191) would replace 
intelligence quotients with standard scores. 

Another sort of criticism is directed at the American testing movement 
as such. Duthil (48) likes what we are doing. Leitzmann (106) also of- 
fered favorable comment. He wonders “why we [in Germany] have so 
long overlooked the achievement tests and the rich literature that goes 
with them.” He then offers these reasons. In general, Germany has paid 
little attention to investigations in other lands. The great war and the 
years of inflation following it have interfered. There is an attitude of 
“What can America, the land of practicality and pragmatism tell to us, 
the very authors of pedagogy?” The author quotes William Stern to the 
effect that our tests are too mechanical and that they are the product of 
a “test-cult.” Leitzmann then proceeds to quote several of Counts’ adverse 
criticisms of tests but offers nothing in rebuttal. He sees a contradiction 
in our claim that every child is unique and test construction based on 
central tendencies. In spite of all adverse criticism, the author concluded 
that Germany can gain from educational tests “much valuable help in 
formulating more objective and useful judgments.” 




























Statistical Procedure 


There is much interest in statistical procedures related to testing. 
Blumenfield (20) gave an interesting mathematical treatment of test 
evaluation. Bobertag (22) has a rather elementary discussion of statistics 
for his readers in Germany. Dwelshauvers (57), Meili (115), and Spear- 
man (178, 179) presented discussions concerning the G factor. Emmett 
(64) and Wilson (202, 203) were concerned with the tetrad criterion for 
use in scholastic examinations. The point at issue was the G factor. Fessard 
(71), Levitof (103), and Valentine and Emmett (197) were interested 
in reliability. Statistical phases of test standardization were discussed by 
Fessard and Piéron (68), and Mme. H. Piéron (147, 148). Kern and 
Lindow (89) and Lipmann (108) reported discussions of curves and 


446 























curve fitting. The applications of the theory of probability were presented 

and criticized by Moers (120), Myers (127), Poppelreuter (151), and 

Thouless (193). Pauli and Wenzl (134) presented a “very simple method 

for the calculation of coefficients of correlation.” They suggested the 
M,—M., 

formula R = MM in which the M’s represent the means of the upper 


and lower halves respectively of each distribution. Rosenblum (161) 
developed a method of testing homogeneity among test items. Thumb 
(194) presented clearly the theory and technic of multiple correlation 
applied to the weighting of test items. He suggested the use of polar co- 
ordinates which he illustrated with interesting geometrical figures. Urban 
(195, 196) wrote a series of articles on methods. 


Test Batteries 


Ten articles were concerned with test batteries. Ballard (17) is clearly 
the most influential author of educational tests in the countries covered 
by this report. His tests are available for commercial use and are command- 
ing attention in both France and Germany. Decroly (37, 38, 39, 40) used 
them in Belgium and they have been used in Germany. Duthil (51, 55) 
published a manual for the use of intelligence, achievement, and aptitude 
tests. Sandon (167) reviewed a secondary-school test prepared by Brock- 
ington. Rather elementary discussions of testing methods were presented 
by Duthil (51, 54, 56), Frickx (77), Nihard (130), K. Stern (182), and 
W. Stern (183). 


Tests as Survey Instruments 


A number of articles were concerned with the use of tests for survey 
purposes. Bartsch (18) compared the achievement of pupils in two types 
of German schools. His tests were designed to measure concentration of 
attention, interpretation of familiar pictures, the combination of mean- 
ingless figures, the power of imagination with reference to given pictures, 
the power of observation, and forgetfulness. Bartsch used the profile 
technic but failed to describe his tests. Bobertag (23) compared estimates 
and measurements of achievement in the German folk schools. He showed 
that measurements are much more reliable. Bobertag is one of the few 
German investigators who has the American point of view concerning 
tests. A series of articles covered a study of achievement in certain schools 
of college rank in Germany (83). The authors were interested in the 
stage of knowledge attained by pupils of different intellectual and social 
levels. They agreed that “testing procedure is looked upon as an American 
effort toward rationalization and tends to become the single ideal toward 
which the human soul strives,” but they could not forget the “very critical 
condition of the ‘Hoch-Schule’ studies and their immense defects.” The 
authors expected their studies to be criticized as “grab and snatch” pro- 
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cedures. They defended them not as individually diagnostic but as a “statis. 
tical method of searching after knowledge.” The experiment involved 
1,143 subjects and many assistants. The tests covered such items as power 
of observation and description, the understanding of textbook content, and 
the command of a foreign language to be chosen by the subject. The tests 
of observation and description contained pictures of skulls half human 
and half chimpanzee or ape. The criterion was a description written by an 
expert describer of skulls. The test conditions were carefully controlled 
in all cases. 

The authors were much interested in sex differences. When the girls 
did not do as well as the boys the cause was ascribed to “the greater 
sensitivity of the woman.” In some cases the girls excelled the boys. Con- 
cerning this result the authors say: “An explanation of this fact is naturally 
difficult to give. One might point to the well-known psychological fact 
that the female sex is inclined to be, for the most part, more frivolous, 
more eager for praise, more diligent and quite devoted to details and 
minutiz.” When the girls excel, the authors simply cannot understand it 
or reconcile themselves to it. 

In both the English and French examinations the content was a dis- 
cussion of M. Briand’s reply to Mr. Kellogg concerning the outlawry of 
war. Reference is made to the isolationist scruples of America. A portion 
of the text reads as follows: “Whatever happens and whatever the other 
struggles in which France may find herself involved, she can rely, at worst, 
on American neutrality.” Great care was taken throughout the experiment 
and much attention was devoted to errors. The articles have much to say 
concerning the difficulties encountered in evaluating answers. 

Huth (84) used modified forms of tests prepared by Bobertag and 
Hylla with different time limits. He was interested in attention, applica- 
tion to details and to relations, association, command of language, dis- 
crimination of space relations, concepts of order, ability to make analogies, 
and ability in mathematical and organized thinking. Attention was also 
devoted to such items as interest in work, understanding of work, care- 
fulness, and sobriety. 

A test battery published by Bobertag consists of six parts, two of which 
are devoted to reading, two to arithmetic, one to language, and one to 
spelling (101). The tests may be purchased at one mark per copy from 
the Zentralinstitut fiir Erziehung und Unterricht, Berlin W 35 Potsdamer 
Strasze 120. This is the only instance noted of tests published in Germany 
for commercial use. 

Miiller (124) reported a survey of the relation between school per- 
formance and social influences. School performance was measured in 
terms of teachers’ estimates. Weigl (199) listed 790 pupils with reference 
to logical thinking, spatial discrimination, and power of attention. The 
occasion for the tests was the passage of the pupils from a lower to 4 
higher school. 
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Tests of School Subjects 


Arithmetic—Tests were reported for eight individual school subjects. 
Arithmetic leads with eight articles. Burt (30) and Duthil (50) reported 
tests that are available commercially. Korn (91) presented a study of 
performance and error in calculation. Lindbeck (107) was interested in 
the work curve of school children. The task was to complete sums in which 
the digit in the ten’s place had been omitted. Rate was measured by having 
the children work where they were at the end of each time interval. Work 
curves of individual pupils were shown along with the curve for the entire 
class. Deviations were interpreted in terms of temperament, intelligence, 
emotionality, and interest. Experimentation of this sort would be helpful 
in America. 

Révész (157) reported a study of arithmetical achievement and skill 
in the highest levels of the primary school (sixth grade). The study was 
carried on in 19 schools of the city of Amsterdam. The tests were a 
modification of the Ballard tests. They comprised in part 100 abstract 
examples and 100 problems. The pupils’ papers were examined for errors 
and the types of errors were reported. In problem-solving, hoaxes were 
introduced such as the following: 


1. A boy has to walk 3 kilometers to get to school. He can ride four times as fast 
on his bicycle as he can walk. How far must he ride on his bicycle to reach school? 
2. A man can go from his home to the depot in 20 minutes. His son likewise can 


make the same journey in 20 minutes. How many minutes will it take them if they 
go together? 


3. If it takes 3 minutes to boil an egg how many minutes will it take if there are 
ten eggs in the water? 


Nearly 60 percent of the pupils failed to see these hoaxes. Only 71% per- 
cent of the pupils could solve correctly time and space problems such as, 
“My watch gains 4 minutes a day. If I set it right at noon Monday, what 
time will it be the middle of next week when it is 6 P. M. by the right 
time?” 

Only 164% percent filled out correctly “The minute hand goes 
times as fast as the hour hand.” Twenty-eight percent solved, “A thirty- 
five-year-old man is 7 times as old as his son. How many times as old as 
his son will the father be in 25 years?” The problem “A man walks 6 
kilometers per hour from B. Two hours later another man starts after him 
on a bicycle and rides at the rate of 10 kilometers per hour. How long 
must the bicyclist ride to overtake the man who is walking?” was solved 
correctly by 15.6 percent of the children. 

Thirty-five percent of the children correctly solved, “A man goes 5 
miles north, then 5 miles east, then 5 miles south, then 5 miles west. How 
far is he then from where he started?” Only 10 percent solved correctly, 
“Along a 90 meter street there were trees standing from beginning to 
end on both sides, 6 meters apart. How many trees stood along the street?” 
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The following are listed as easy examples that first-year pupils ought 
to solve. The percents following in each case indicate how many pupils 
really solved them. 


1. There are 40 nuts in a saucer. After 5 men have eaten 7 nuts each, how many 
nuts will remain in the saucer?—50.3 percent. 

2. A boy measured a rope with a meter-stick and found that it was 6 meters long. 
But someone had secretly cut off one centimeter from the meter-stick. How long was 
the rope actually?—31.5 percent. 


3. If 1% kilograms of cocoa costs 3 RM. (reichs marks) how much does one kilo- 
gram cost?—83 percent. 

4. I divided five marks between two girls. One got 1% times as much as the other. 
How much did the girl get who got most?—53.1 percent. 

5. How many apples can I buy for 1.20 RM. at the rate of 4 for 15 pennies? (one 
RM. equals 2% pennies) —22.9 percent. 

6. A band box and the key for it cost together 2.20 RM. The box cost 2 RM. more 
than the key. How much did the key cost?—77 percent. 


Like many other investigators the author speaks of the errors as being 
surprising and almost unintelligible. 

As in America, many children failed to read all of their problems. This 
error caused 47.3 percent of the pupils to fail on the problem, “The foot of 
a hill is 200 meters above sea level. The top of the hill is 400 meters above 
sea level. How far above sea level is a house that stands half-way up the 
hill?” 

Inexperience with space relations caused trouble with the following 
problems. The percents show the number of correct solutions. 


1. We have 60 centimeters of iron wire. Out of it we make five triangles of equa! 
size and equal sides. How long is each side of such a triangle? —56.7 percent. 

2. A rectangle is twice as long as it is broad. Its area is 200 square decimeters. 
How long is it?—23.4 percent. 

3. Around a rectangular flower bed 3 meters long and 2 meters wide there is a 
grass fringe % meter wide. What is the area of the fringe?—10.3 percent. 

4. The outside dimensions of a rectangular mirror are 15 decimeters long and 8 
decimeters wide. The mirror has a frame 1% decimeters wide. What is the area of 
the mirror inside of the frame?—26.5 percent. 


The tests also included material designed to test mathematical under- 
standing and insight. 

Schmidberger (169) used arithmetical exercises with 1,121 boys and 
1,163 girls in and about Jena to test sex differences. Seemann (170) pre- 
sented an extensive study of the psychology of number and of errors in 
calculation. This is in all probability the most complete study of errors 
with whole numbers that has been published in any language. The author 
classified errors as mechanical, associative, and functional. Mechanica! 
errors were interpreted as usual. Associative errors were due to similarity 
of sound, similarity of figure, and perseveration. Functional errors were 
those of operation, logic, and harmful transfer. The author concluded 
that errors are never accidental. He gave a very detailed presentation 0! 
errors in the four fundamental processes. 
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E. Thomson (194, 189) used arithmetic exercises as a part of a battery 
of tests to study the efficiency of individual work. 

Spelling—Six references relate to the testing of spelling. Brandicourt 
(27) used tests which require the subject to write the proper names under 
drawings. Duthil (49, 56) discussed methods of testing spelling. The 
results of tests in spelling were reported by E. Thomson (189). 

Language—F our articles were found which dealt with language testing. 
Duthil (53) was concerned with the measurement of ability in composi- 
tion. Williams (201) reviewed the North Hampton Composition Scale. 
Bobertag has a language test (101). 

Reading—F our studies are also presented which involve the testing of 
reading. K. Stern (182) was interested in beginning reading. E. Thomson 
(189) reported results with Haggerty’s reading tests. Thorner (192) pre- 
sented a rather extensive study of the psychology of reading. The content 
of the test was partly sense and partly nonsense material. Reading difficul- 
ties were treated in terms of size of type, vocabulary, length of words, 
and word form. The data were collected by means of the tachistoscope. 

Miscellaneous—Bradford (26) presented a study of perspective in 
geography. Eaglesham (58) used content relating to the geography of 
Australia in studying retention. Eggink and Bradford (60) presented fur- 
ther arguments concerning the testing of perspective in geography. Fessard 
and Fessard (72) discussed musical aptitude as shown by the Seashore 
records. Mainwaring (111) presented his tests of musical ability. Simon 
(175) presented a discussion of group tests in drawing. Theiss (188) 
tested the relation of handwriting to the character of the writer. Leopold 
(102) presented tests of the appreciation of poetry. Brandicourt (27) 
tested vocabulary. Robson (159) presented a study of vocabulary burden 
in the first year of French study. She reported that “children with an 
average age of about 12 and who spend two hours weekly at French are 
able to deal with a vocabulary of from 450 to 812 words during the first 
year. Within this range they succeed in correctly recognizing an average 
of 71% of the words presented but can actually reproduce only 58% of 
them.” The study is both extensive and thorough. 


Examinations and Marks 


There is considerable interest in the form of examinations. Champneys 
(32) presented a bibliography of studies relating to the English examina- 
tion system. Emmett (64), G. Thomson (190), and Wilson (202, 203) 
were concerned with a tetrad criterion for use in examinations. Farmer 
(65) presented data on the predictive value of examinations. Kesselring 
(90) has a study of entrance examinations, especially for teacher-training 
institutions. He investigated the possibility of substituting newer test forms 
for the present state examination in Germany. He suggested tests of memory, 
attention, and the higher intellectual processes. He presented in detail the 
nature of the tests that he would have used. He recommended that new 
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tests be used instead of the present examinations. Valentine and Emmett 
(197) were concerned with the reliability of examinations. Piéron (138) 
and Sandon (165) presented criticism of examination methods. 

Relatively little is being written concerning the use of teachers’ marks 
in testing. Braun (28) and Sandon (166) based their studies on teachers’ 
marks. Sandon (164) also presented a study of marking systems. 


Aptitude and Character 


Tests of aptitude are not within the range of this number, yet it seems 
desirable to call the reader’s attention to a few studies in that field. 
Claparéde (33) has a book on discovering aptitude among school pupils, 
and De Montlebert (45) on determining aptitude by tests. Duthil (51) 
also presented a study of tests of aptitude. Fessard and Fessard (72) were 
interested in musical aptitude. Fessard (70) discussed the interpretation 
of numerical results in aptitude examinations. Lahy (97) reported re- 
sults obtained with the Stenquist Test of Mechanical Aptitude. Mennens 
(117) has a study of mental aptitude among prisoners. 

Some interest has been shown in tests of moral character. Ekenberg 
(61) presented a review of the character tests published in America, 
France, and Germany. Henning (82) presented new apparatus and methods 
for use in testing character. Moers (120, 121) developed tests of moral 
understanding, and Reynier (158), tests of character. 


Physical Tests 


There was also interest in tests of movement. Liefmann (104) studied 
the relation between mental and bodily achievement. Her tests of physical 
performance include such activities as stringing beads, rope climbing, bal- 
ancing, rowing, jumping, and tests of suppleness, speed, and muscular 
strength in arms and legs. Most of these tests were adapted from those of 
Serebrowskoja (172, 173) of Moscow. Stephenson (181) was interested 
in tests of motor perseveration. 


Diagnosis, Prognosis, and School Readiness 


Quite a number of workers claimed diagnostic value for their produc- 
tions. Brandicourt (27) presented drawings as a basis of diagnosis. Clapa- 
réde (33) has much to say on the subject. Kovarsky (92, 93, 94, 95, 96) de- 
scribed the use of the psychological profile for diagnostic purposes. Rohr- 
schach (160) discussed the methods and results of psychological diagnosis. 

Only three workers made claims as to tests having prognostic values. 
Bobertag (21) and Lietzmann (105) discussed the comparative prognostic 
value of tests and school reports. Farmer (65) studied the “predictive value 
of examinations and psychological tests in skilled trades.” 

Tests of school readiness are usually of the mental type. Descriptions of 
a few of them by Biihler and Hetzer (29), Remy (155), Winkler (204), 
and others (187) are noted here as samples. 
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Orientation and Social Participation 


The use of tests for orientation and guidance has attracted some atten- 
tion abroad. Ménessier (116) described the use of tests of professional 
orientation. Similar discussions were presented by the Piérons (139, 144, 
147). 

Saluschny (163) reported an interesting study of the measurement of 
social participation among groups of school children. The experiment was 
carried on in Charkow, Ukraine. The tests consisted of eight tasks. Groups 
were required to choose a representative to participate in the school con- 
ference of the city. They were also asked to choose a director for the ar- 
rangement of the schoolroom, to sketch a plan for community work, such 
as a school picnic or excursion, to select the pupils best suited for each 
duty, and to pick out the five members of the group who in the opinion 
of the group exercise the most disorganizing influence. The plan of scoring 
is explained. The coefficients of reliability ran as high as .98 + .006. The 
test was given to 8,000 children. 


Psychological Tests 


Owing to their organismic views it is natural to expect foreign workers 
to use psychological tests for educational purposes. Abadi (15) reported 
a test of personality. Abrahamson (16) and Bartsch (18) presented tests 
of imagination. Abrahamson (16), Bartsch (18), and Monchamps (122) 
were concerned with tests of observation. Bartsch (18), Gamsa and Salkind 
(79), Mme. Piéron (146), and Weigl (199) reported tests of attention. 
Bartsch (18), Fischler and Ullert (73), Mme. Piéron (147, 149), and 
Kesselring (90) used tests of memory. Dilger (46) discussed classifica- 
tion by means of the Gaussian curve. Eder (59) undertook the diagnosis 
of carelessness. He assumed that LM=SM times K. Where LM is the best 
possible performance of an individual, SM is the worst possible, and K 
is carefulness. Letting LN equal the performance under normal conditions 
and substituting, the author arrives at the formula LN=SN times K. He 


considered Ln **2 carefulness quotient. It is interesting to note that the 


author’s assumptions are related as factors in a product rather than as 
addends. 

Kennedy (88) found that the Downey Will Temperament tests had low 
reliability. Mira (119) reported a new test for the exploration of affec- 
tivity. Mme. Piéron (145) described what she called psychotechnical re- 
searches in the school. 

Perseveration—Cattell (31) reported p-tests of perseveration. Rangachar 
(154) tested perseveration among English and Jewish boys. The advantage 
was with the Jews. Stephenson (181) described tests of perseveration and 
offered extended discussion of the p-factor. 

Fatigue—F oucault (76) wrote on mental work and fatigue. Lindbeck 
(107) treated work in written calculation in terms of work curves. 
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Practice—Lammermann (99) reported a study of the practice effect 
upon intelligence scores. He obtained practice quotienis by dividing en 
performance by initial performance. His quotients ranged from .63 {0 
1.26. He found the practice effect so marked as to cast serious doubt upon 
the diagnostic value of the tests. Kern and Lindow (89) presented a mathe. 
matical treatment of practice curves. They used the technic of curve-fitting 
and derived y = a log x + b, as the general equation for practice. They 
described the calculation of the constants a and b. 

Reasoning—Miiller (125, 126) made extensive and careful studies of 
reasoning among school children. The subjects were aged from six to 
eighteen. They were required to complete syllogisms by supplying con. 
clusions. The syllogisms contained space, time, larger-smaller, equiva. 
lent, whole-part, genus-species, symbolical (algebraic), negative, hypo- 
thetical, disjunctive, and causal relations. Individua] reactions to the 
syllogisms were recorded and age norms were derived. These articles are 
of interest to all who are concerned with the task of teaching children to 
think. 

Weigl (199) used tests of logical thinking. The tests of arithmetical 
reasoning by Révész (157) have been described previously under arith. 
metic tests. 

Selection of pupils—Considerable interest was shown in the use of tests 
for purposes of selection. Emery (63) dealt with the social value of selec- 
tion. Peyraube (136) discussed German methods of school selection. 
Schlotte (168) reported tests for the selection of the poorly talented. 
Wellens (200) discussed the selection of the élite. 

es 
Summary 

The purpose of this chapter has been to supply information concerning 
the uses of tests and measurements in China, England, France, and Ger- 
many. There is plenty of evidence to show that educational tests are in 
process of development in China. The Chinese movement is the result of 
American influence. 

The tests of England, France, and Germany have also been stimulated 
in part by American influence, but they also show a point of view dis- 
tinctly their own. In England, statistics is still a dominant interest. France 
is interested chiefly in intelligence tests, while Germany has specialized in 
aptitude testing. In Europe educational tests are used for the most part 
as aids in research. Many of the foreign studies are more extensive and 
thorough than those in this country. American workers can derive much 
benefit by familiarizing themselves with the methods and point of view of 
their fellow workers in foreign countries. 
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CHAPTER Ii 


Present Tendencies in the Uses of Educational 
Measurements 


Taat THE TESTING MOVEMENT of this country has experienced phenome- 
nal growth since 1900 is now the common knowledge of all educators. 
A study of this growth shows a fortunate decline recently in the output of 
materials for which, in many instances, the motive was to share in the 
commercial returns of an innovation. The testing movement is now entering 
upon a period of slower but much more significant and substantial growth. 
The tests published and the uses to which they are being put are the 
result of more precise and detailed investigation. Valuable experimenta- 
tion involving the construction and use of tests is being conducted. Testing, 
removed from the mysterious category in which it was formerly placed by 
conservative educators and laymen alike, is coming to be regarded as an 
integral part of any educational program. There are those, of course, who 
look with skepticism at some of the applications of tests, but on the whole, 
tests and testing continue to flourish. The testing movement has made and 
continues to make lasting contributions to education at all levels. 

Perhaps one of the main reasons for this very wholesome attitude toward 
measurement is that proposed by Lincoln and Workman (232). They say 
that, at first, testing was largely for the expert, since he alone knew the 
field, but now tests are devised for use and interpretation by teachers. Many 
books written in non-technical language are available on educational 
measurements, and teacher-training institutions deal with them as a regular 
part of the curriculum. Such textbooks on supervision as those by Burton 
(210), Carpenter and Rufi (211), Garrison (221), Monroe and Streitz 
(240); and others include testing as an orthodox procedure in instruction. 
Other texts such as those by Brueckner and Melby (209), M. Monroe (239), 
and Gates (222) make all remedial instruction absolutely dependent upon 
testing. Some of the larger cities maintain bureaus of research whose 
primary purpose is to deal with problems of measurement. 

One may well wonder in what direction we are going and demand proof 
that tests and testing are valuable and becoming indispensable. An excel- 
lent discussion of the “pros” and “cons” of testing in the social studies 
which may well be applied generally was presented by Kelley and Krey 
(230). The discussion dealt with such topics as the relationship of testing 
and instruction, intelligence tests and homogeneous grouping, reliability 
and validity of measuring instruments, function and limitation of different 
types of tests, aspects of instruction and achievement measured by objective 
tests, relationship between objective tests and important aspects of educa- 
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tion, relationship between individual differences and tests in the social 
studies, experimental method, and use of tests for appraisal in educational 
guidance and vocational counselling. 

Along with the “slower but surer” progress of testing has come the 
analytical criticism of specialists in the field who realize that testing is 
not a cure-all. McConn (235) suggested that one of the reasons why edu- 
cators have thus far failed to make continued and systematic use of tests 
is that there is a lack of comparable forms of examinations. The Cooperative 
Test Service is seeking to meet this criticism. McConn (234, 236) further 
criticized the much-repeated error “of assuming that results of any single 
examination or single set of examinations can be regarded as definite, 
conclusive, and valid as bases for crucial administrative action or crucial 
administrative advisement.” Wood (268:6), in an article on basic consider- 
ations in educational testing, stated that “the chief defect in the testing 
movement has been the neglect of building an adequate philosophy and 
system of using test results for effective and constructive educational guid- 
ance in the largest sense of that word.” Limitations and criticisms of mental 
testing suggested by Horn (227) include such statements as the following: 
(a) the movement is still in a formative stage; (b) mental ability is still 
too exclusively ascribed to determined forces; (c) misconceptions exist 
concerning the degree to which intelligence, as measured by tests, enters 
into achievement; (d) it is becoming increasingly evident that the prog- 
nosis of scholarly achievement may be better accomplished by special 
aptitude and achievement tests which concentrate on defined areas of 
subjectmatter; and (e) it is too often forgotten that there are areas of per- 
sonality and social achievement in which the ingredients are not limited to 
intelligence. Concerning objective tests, Horn’s chief criticism dealt with 
the placing of a premium on rote learning and a consequent failure to 
measure the higher thought processes. However, he stated further that 
“there is no evidence that these defects are peculiarly characteristic of 
objective tests” and that “the objective examination is demonstrably supe- 
rior to the traditional examination.” 


State and National Testing Programs 


Serving as conclusive evidence of the growth of the importance of meas- 
urement in education are the state and national testing programs which 
exist at the present time. 

A study by the United States Office of Education (255) showed that in 
1933 there were state testing programs in twenty-three states of the Union. 
Segel (253) described one which is especially worthy of note—the aca- 
demic contest carried on in Iowa and sponsored by the University of Iowa. 
All schools are invited to participate and a great many of them do. Each 
school entering on a competitive basis agrees to give the tests to all students 
taking the subjects listed. The papers are scored by the schools and sent 
to the contest director who sends back reports which enable each school to 
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compare its results with those of other schools as well as to compare its own 
pupils with each other. These test results are also used to determine school 
marks, to guide individuals, and to ascertain the general level of attainment. 
Most of the state testing programs are under the direct supervision of the 
state university or state department of public instruction. 

In the bulletin of the Office of Education (255) referred to above and in 
a study by Chase (215) the following organizations are listed as conducting 
national testing programs: the Educational Records Bureau, the Cooper- 
ative Test Service, the College Entrance Board, Kansas Nation-Wide Every 
Pupil Contest, and the American Council on Education. 


The Educational Records Bureau conducts an annual testing program for inde- 
pendent schools. Its membership has grown from twelve schools in 1927 to over two 
hundred at the present time. In 1933 it organized a program for public schools under 
the direction of a Public School Supervisory Committee. Its chief purposes are (a) to 
score tests and report results and (b) to act as a coordinating center for testing so that 
results on different tests may be comparable. 

The Cooperative Test Service is an organization designed to construct tests. It has 
been subsidized by the General Education Board for a period of ten years to construct 
ten or more comparable forms of examinations in the subjectmatter of the junior and 
senior high-school levels. This Service cooperates with other agencies such as the 
Educational Records Bureau and state associations by using their tests in national 
and state high-school testing. For example, the state high-school testing program under 
the auspices of the Association of Minnesota Colleges has used the tests of the Cooper- 
ative Test Service. Tests are constructed in such subjects as the Romance languages, 
mathematics, science, and social science. 

The College Entrance Board was made up in 1930-31 of fifty members, thirty-nine 
representing universities, colleges, and scientific schools, and eleven representing sec- 
ondary schools. It carries on a program of testing with the graduating seniors both 
in the high-school academic subjects and in scholastic aptitude. These results are 
acceptable to all universities and colleges in the United States. A report (255) of 
1932 shows that 19,929 students took one or more of these tests. The examinations are 
prepared by the Board itself and arrangements for giving and appointments for 
scoring are made by it. Committees for scoring papers are composed of representatives 
of many different colleges and universities and secondary schools. All work is done 
independently of any one institution. 

The Kansas State Teachers College Bureau of Educational Measurement surpervises 
a nationwide scholarship test twice each year. The tests are constructed in the 
Kensas high schools under the direction of the Bureau. Participating schools score 
their own tests and send tabulations to the Bureau of Measurements which makes a 
report containing a national norm and the state norms. 

The American Council on Education annually sponsors the construction of a new 
edition of the American Council Psychological Examination. The examination is 
widely used with high-school seniors and college freshmen. An edition has recently 
been constructed for use with high-school sophomores. 

The Psychological Corporation has conducted a national survey of English usage 
which has been made a state program in Pennsylvania and Ohio. 


“Cooperative” testing as such is not a recent development. Certain states 
conducted state high-school examinations many years before the stand- 
ardized testing movement. The gain in favor of this type of testing is due 
to the distinct advantage it affords over the individual school testing 
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programs. Douglass (216) listed among favorable advantages that com. 
parable scores are available for a whole area, making possible a better 
understanding of achievement, and that this comparison provides motiva- 
tion for pupils and teachers as well as serving as a means of classification. 
A further step in the direction of providing standards for comparison, 
reported by Segel (254), is being carried on under the supervision of the 
United States Office of Education in an attempt to establish equivalen; 
scores among different intelligence tests. 


Trends in Test Construction and Use 


There are various methods of classification of the uses of tests. For pur- 
poses of brief review the following outline will prove helpful: 


1. Prognosis 

2. Surveys 

3. Diagnosis and remedial treatment 
4. Instruction, grouping, and marking 
5. Experimentation and research 

6. Guidance. 


Prognosis—Some valuable contributions to measurement are made by 
those who would predict the success of an individual in a given field. Fritz 
(220) recently attempted to determine the value of certain tests for pre- 
dicting college marks and teaching success. He found two tests (Aptitude 
Test for Elementary High-School Teachers and American Council Psyco- 
logical Examination) which he considered helpful but by no means depend. 
able in predicting college marks. He also concluded that if the Aptitude 
Test really measures ability to teach school, those students earning the 
better marks are better risks as teachers. Segel and Gerberich (252) studied 
the possible use of the American Council Psychological Examination in 
differentiating ability to do college work and ability to do work in specifi 
subjects. They concluded that the test should not be used for differential 
prediction purposes when college marks are the criteria. Finch and Nemzek 
(219) studied the prediction of college achievement from data obtained at 
the beginning and end of the secondary-school period. The authors con- 
cluded that “it is quite improbable that a combination of marks from a 
wide variety of schools will afford a satisfactory means of predicting further 
achievement until marking systems are decidedly improved. The present 
results do demonstrate that it is possible under favorable conditions to 
obtain from marks assigned by high-school teachers a predictive measure 0{ 
distinct value.” A worthwhile evaluation of the prognostic value of different 
types of tests in educational psychology was made by Terry (260). The 
Van Wagenen Reading Scale in Educational Psychology, Form A, was 
compared with the Iowa Silent Reading Test and with the Otis Group 
Intelligence Scale in an effort to determine the relative value of the tests 
for predicting ability in educational psychology and in teaching. The 
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criteria of achievement were an objective test, an essay test, and the objective 
and essay test combined in a single score. The subjects were the members of 
two classes in educational psychology attending summer school at the 
University of Alabama. The findings include the generalizations that (a) 
the Van Wagenen Test is less dependent upon the factor of intelligence 
than the reading tests considered; (b) there is a considerable degree of 
validity in the Van Wagenen Scale and it may be a better means of pre- 
dicting achievement in educational psychology than tests of general read- 
ing ability; and (c) the value of the Otis Test as a means of predicting per- 
formance should not be overlooked, although a combination of the Otis 
and Van Wagenen yield a higher correlation with achievement than either 
test taken alone. A somewhat similar study was made by Torgerson and 
Aamodt (261) in which the purpose was to determine the validity of two 
aptitude tests in algebra as compared with an intelligence test. The tests 
chosen were the Lee Test of Algebraic Ability, the Orleans Algebra Prog- 
nosis Test, and the Otis Self-Administering Test of Mental Ability, Higher 
Examination. The tests were administered to 236 ninth-grade pupils in 
Muskegon, Michigan. All three tests were found to be about equally valid 
and effective in predicting grades in algebra. The aptitude tests were about 
equally efficient in setting up a critical score below which the students’ 
chances for success were slight. The sharpest discrimination was made 
by the intelligence test, as 22 of the 23 pupils with intelligence quotients 
below 90 failed in algebra at the end of the year. 

Another important type of prognostic test is the reading readiness test. 
Segel (254) said, “The problem of accurately measuring the readiness of 
children to read has become more important since the advent of primary 
curriculums which allow for considerable variation in the time for begin- 
ning reading of different pupils. These tests have been found to have scores 
which have a fairly respectable relationship to later achievement in reading 
or other first-grade school achievement.” A study dealing with informa- 
tional background of kindergarten children as it affects reading readiness 
was reported by Troxel (262). Seventy-four kindergarten children were 
selected according to teachers’ classification of background as good or poor. 
Data were obtained through the health clinic, teachers’ or principals’ know}l- 
edge of the home, questionnaire filled in by parents, and information fur- 
nished by the visiting nurse. The summary of the study stated: 


The study of the questionnaire brings out facts which seem to confirm the importance 
of a rich informational background in reading readiness. The rich background children 
as judged by the kindergarten teachers of Kalamazoo, live in smaller families, have 
had more opportunities for travel, have richer play experiences, and are surrounded 
by a reading environment far richer than the children having a meager background. 
The study seems to justify the conclusion that the technic used in determining richness 
of background, or a part of it, could be used in mapping out instruction procedure, 
in grouping children, in enriching the curriculum, and in understanding some of 
the underlying causes of failure in learning to read. 
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Surveys—This use of tests is the one with which the state and national 
testing programs are largely concerned. A survey is conducted primarily 
for the purpose of comparing schools with schools, cities with cities, states 
with states, etc. In other words, such a program is concerned with group 
results only. Douglass (216) discussed the effect of state and national 
testing on secondary schools. The chief objections relate to the artificial 
determination of objectives and methods, the fact that all instructional 
activities are directed toward cramming things measured by examinations, 
and the overemphasis on traditional subjectmatter. Advantages which 
Douglass suggested have to do with facilitation of comparison, better basis 
for school marks, motivation for teachers as well as pupils, training for 
pupils in test taking, and the stimulation of teachers’ interest in objective 
tests. Lincoln and Workman (232) stressed the fact that results of survey 
tests are not individual performances but group averages. They did, how- 
ever, classify the use of tests to single out individuals of the group for 
further intensive study under the survey heading. Horn (227) emphasized 
the use of caution in the interpretation of survey tests in the evaluation of 
instruction by reminding (a) that any test measures only a minor part 
of the total results of instruction, and (b) that relative performance is 
dependent on a large number of factors of which the result of immediate 
instruction is but one. Other references to the survey test were made by 
Segel (253, 255), Johnston (229), and Allen (205). 

Diagnosis and remedial treatment—The use of tests for diagnostic pur- 
poses is of decidedly growing importance. Supplementary to this usage 
is remedial instruction and general improvement of instruction. A diag- 
nostic test in any subject reveals which skills have been sufficiently mastered 
and which require further instruction. Ramsay (247) reported the results 
of an investigation dealing with diagnostic testing and remedial teaching 
in the junior-senior high school. His conclusions were three: (a) teachers 
need not hesitate to undertake diagnostic testing and remedial instruction 
even with the limited facilities of the small junior-senior high school; ()) 
large gains are possible by deficient pupils in the ability to study, in the 
mastery of fundamentals of reading, and in the acquiring of the funda- 
mental concepts of arithmetic; and (c) remedial instruction, as evaluated 
by teacher judgment, carries over to the instruction of the regular classroom. 

The need for diagnostic testing was discussed by Olander (245) who 
pointed out the poor ability of teachers to diagnose pupils’ errors. He re- 
ported an experiment in which arithmetic tests were given to pupils indi- 
vidually. They were required to make their calculations aloud, and these 
were recorded verbatim. Two papers, representing four skills each, were 
mimeographed and corrected by forty teachers who had no knowledge of 
the verbatim record of calculations. Later, these records were made avail: 
able and the teachers were asked to correct the papers in the light of their 
knowledge of the pupils’ methods of work. It was found that the average 
percent score was lowered by 6.6. The teachers were asked to diagnose the 
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errors before they knew the pupils’ methods of work. Results showed that 
errors in long division were diagnosed with a high degree of accuracy, but 
the average in the diagnosis of the three other fundamental skills—addition, 
subtraction, and multiplication—was less than one error out of three cor- 
rectly diagnosed. 

Carter (212) reported a study illustrating the diagnostic testing and 
remedial procedure involved in a case of reading disability. Such topics are 
discussed as (a) innate ability and factors affecting performance, (b) 
school status, (c) initial status in reading, (d) diagnosis and suggestions 
for remedial work, (e) remedial treatment, and (f) results of remedial 
instruction. 

Smeltzer (257) discussed an efficient method of giving and scoring 
objective tests for diagnosis in the classroom. Seventy-five test items placed 
on three mimeographed sheets were distributed to pupils. The numbers 
from one to seventy-five were written on the board by the teacher. When 
three-fourths of the pupils had finished, the papers for each row, without 
names, were collected, and redistributed in another row. Incorrect and 
omitted answers were checked. Then by counting raised hands, the number 
of students who made mistakes on each question was recorded on the 
board. Scores were entered on test papers, the papers returned to the 
owners, and the names written on them. 

A study reported by Whitmer (266) evaluated the progress made by 
college probationers in the years succeeding remedial work. He concluded 
that there is a probability that, with assistance, the probation student 
reaches his level of academic performance more quickly than he would 
without such help. 

Practically any good book dealing with tests and diagnosis is replete 
with case studies and discussions of diagnostic technic. 

Instruction, grouping, and marking—Besides motivating individual 
remedial instruction, testing is used rather widely in homogeneous or abil- 
ity grouping. Tests of intelligence are used mainly for this purpose. There 
has been much discussion of the value of this use of tests with none too 
clear and definite conclusion as to whether or not its advantages outweigh 
its disadvantages. A study was reported by Sauvain (250) showing the 
relative attitudes of parents, teachers, and school officials toward this 
method. He made the observation that the trend seems to be away from 
grouping. His study revealed that teachers and principals favor grouping 
more than do parents, and only those parents whose children are in the 
lower groups regard the system with real disfavor. From the standpoint 
of the teacher, grouping is a great aid in instruction. 

The Tulsa experiment (263) provides that two types of test be given to 
children at the completion of the second grade, type 1 to divide the rapid 
from the less rapid learners, and type 2 to test reading, arithmetic, writing, 
and spelling. The rapid learners are then divided for each subject on the 
basis of achievement. Achievement testing and regrouping take place at 
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the end of each semester. The child does not know whether he is in , 
higher or a lower group; he is promoted regardless of his classification, 
Ten or more levels in each of the tool subjects are provided and reclassifica. 
tion is easy. The age of entrance to junior high school is the same, under 
this plan, as under an ordinary graded system. 

O’Shea (246) suggested an improved method of administration in th: 
elementary school wherein intelligence group tests are used to determine 
(a) the mean mental age of the class, (b) who the superior and inferior 
pupils are likely to be, and (c) what the original tentative group should 
be with which later-formed subject groups may be compared for gaining 
valuable information concerning individual abilities, interests, and effort. 
Classifications should include a definite range of intelligence quotient and 
chronological age. 

Woody and others (270) made a study of the effects of measurement on 
instruction. They concluded that “measurement properly conceived js 
an integral part of the complete teaching process and as such becomes an 
important agency in directing the choice of subject matter and methods 
of teaching.” 

An aid to instruction is found in the workbook and instructional tests. 
Van Liew (265) discussed the justification of the workbook and enumer- 
ated standards upon which a justification depends. He stated that the 
workbook can be justified only when the pupil is given something to do 
beyond merely reading and remembering subjectmatter—only when he is 
given something to make him understand, appreciate, think, apply, or 
construct. Some standards upon which the justification of the workbook 
hinges are: let it (a) guide and aid study, (b) present facts with a view to 
significant organization, (c) avoid mere syllabus forms and topical out- 
lines, (d) distinguish between working exercises and practice exercises. 
(e) be constructed with due respect to teaching procedures, (f) parallel a 
particular text, (g) aid teachers in use of good textbook, and (h) introduce 
pupils to serviceable ideas of methods and sources making for independent 
study. 

Maxwell (238) presented arguments for and against the use of the 
workbook as an instructional aid. Arguments for it include the following: 
(a) the workbook develops initiative and independence on part of students: 
(b) it presents material in definite sequence; (c) if outlined to accompan\ 
a particular text it assures the securing of that point of view and the cov- 
ering of essential materials as intended by the author; and (d) it reduces 
the labor of the teacher. Opposing arguments include: (a) the workbook 
tends to stultify originality because the references are limited to one text: 
(b) it tends to minimize individual differences; (c) it emphasizes the teach- 
ing of the textbook rather than the development of the child; (d) it ten: 
to stress memory work; and (e) teachers follow workbooks too slavishly. 

The results of achievement tests are used for marking pupils. This plan 
yields comparable results from group to group, since standardized tes's 
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provide the norms. Macomber (237) said, “as a measure of academic 
ability, records of standardized achievement tests over a period of years, 
together with records showing the actual attainment of educational goals 
in terms of mastery of units of work, are most meaningful.” The advan- 
tages and disadvantages of marks, together with the existing needs which 
can be taken care of by results of standardized tests were reported by 
Ayer (207). The case against marks shows that marks given by teachers 
are inaccurate and often based upon items other than the actual trait being 
measured; overemphasis is placed upon high marks and the disgrace of 
low marks has been detrimental to pupils; marks put too high a premium 
on the acquisition of subjectmatter and too little upon the improvement 
of the child. The case for marks shows that the best progress is made when 
learners are aware of the rate of their improvement; quantitative marks 
are essential for classification, educational guidance, and research. The 
elimination of marks does not do away with failure but merely covers up 
poor work. Marks should be made more reliable, more specific, and more 
discriminating and should be used as checks or guides rather than rewards 
or punishments. The use of standardized tests may well make marks more 
reliable, specific, and dis¢riminating. 

Experimentation and research—Objective tests, because of their superi- 
ority to other measuring devices, are a great aid to educational experi- 
mentation. With the aid of measurements, the experimenter is able to set 
up groups which are equivalent in such traits as he wishes to control. The 
controlled experiment is most widely used in studies where the effect of a 
single variable is to be measured. Such problems as the value of different 
methods of instruction or of different textbooks, the effect on achievement 
of size of class, etc., are typical of the use to which this method may be put. 
Lincoln and Workman (232) discussed the method, listing several essen- 
tial points, such as (a) wise choice of subjects, (b) careful selection of 
place and time of experiment, (d) control of desired experimental factors, 
and (e) accurate measurement of experimental factors. Several illustrative 
cases of experimental method were presented. 

Robb (248), for example, reported a controlled experiment to determine 
the results of direct and incidental methods of instruction in the field of 
character education. The control group was composed of 182 junior high- 
school pupils and the experimental group of 110 twelfth-graders. Moral 
knowledge and ethical discrimination were measured, showing relatively 
small difference in the two groups. The incidental method, i.e., “purpose- 
ful instruction in the development of character through the creation of 
related experiences closely integrated with the traditional divisions of sub- 
ject matter, and with pupil awareness of such instruction reduced to a 
minimum,” was used with the control group, whereas the direct method, 
“involving a clearly planned course of study in moral instruction,” was 
employed with the experimental group. All comparisons by means of 
rating scales were in favor of the experimental group, even though the 


463 











difference was small in some cases. “The superiority of the instructed group 
was greatest on the ethical discrimination test, and least on the ratings 
given to one another by the students themselves.” . 

Longstreet (233) employed the controlled experiment method in ay 
experiment with the Thurstone Attitude Scales in which he attempted to 
determine to what degree high-school pupils’ attitudes toward patriotism, 
the United States Constitution, and war are affected by courses in American 
history and civics. He found that high-school pupils’ attitudes are not 
changed unless the instructor makes special effort to effect such changes. 
Segel (256) suggested the use of the controlled experiment to evaluate 
instruction by radio. Such a study, reported by Gordon (224), was carried 
on at the University of Wisconsin to evaluate the effectiveness of the radio 
in the subjects of current events and music. Twenty-five rural schools com. 
posed the experimental group and an equal number, chosen by the county 
superintendent, made up the control group. All schools were provided with 
the same study materials. The control group was taught by the classroom 
teacher, but in the case of the experimental group, all instruction was 
given by radio. Results of the experiment showed that the “materials con- 
tained in the radio lessons were better taught than without the aid of the 
radio.” 

In the one-group method of investigation, on the other hand, a single 
group is considered. For example, Charles (214) compared the intelligence 
quotients as shown by three different mental tests applied to a group of 
incarcerated delinquent boys. He found that about the same difference exists 
between the result obtained by the Otis and the Kuhlmann-Anderson as 
exists between that obtained by the Otis and Binet-Simon. There is a much 
closer agreement between the Binet-Simon and Kuhlmann-Anderson than 
between either of them and the Otis, which is estimated as yielding resulis 
about ten points too high. Similarly, a study was made by Engle (217) of 
the personality of a group of high-school honor pupils showing that there 
is no single factor which correlates significantly with scholarship in this 
group. With the aid of this one-group method, Wrightstone (271) made a 
study of the correlation among tests of high-school subjects prepared by 
the Cooperative Test Service. He found that among these tests there is a 
higher correlation than “can be explained by the hypothesis of a general 
factor of abstract verbal intelligence.” 

The correlation formula is often used in this one-group method of experi- 
mentation. This procedure has been employed, for example, by Baker and 
Broom (208) in a study of a criterion for the choice of a primary reading 
test. Data yielded by results of six standardized reading tests administered 
to about ninety students were correlated. The tests were found to validate 
each other very well, as the coefficients of validity ranged from .57 to .84. 
Also, Stagner (258) made a study of the intercorrelation of six objective 
personality tests. Correlations were found to be mostly low, with cases of 
high relationship when identical elements were involved. Moore and Steele 
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(241) attempted to evaluate six most frequently used tests of personality. 
These tests were administered to fifty-eight students at Mount Holyoke. 
General results warrant the inference that “the atomistic method of evalu- 
ating personality seems to have very little promise. . . . An entirely dif- 
ferent method of approach . . . seems necessary.” Jenkins (228) in- 
vestigated the Standard Graduation Examination for Elementary Schools 
as a means of predicting success of pupils in certain high-school subjects. 
This was done by means of correlating test scores on the Standard Gradu- 
ation Examination and high-school grades. Results showed that the total 
score predicted success in social studies better than for other fields; “in 
general, there is a direct proportion between persistence in high school and 
comparative total score on the test.” Seagoe (251) made an evaluation of 
certain intelligence tests by means of this method. In general, it was con- 
cluded that primary tests agree less with upper-grade tests than the latter 
agree with themselves, and the former will correlate less highly with school 
achievement than the latter. 

An excellent evaluation of the validity of certain questions which purport 
to measure neurotic tendencies was given by Landis and Katz (231). The 
Bernreuter Personality Inventory was administered to 184 house patients 
and 40 out-patients of the Psychiatric Institute of Ohio University at Athens, 
either in small groups or singly. All of the subjects were “voluntary” 
patients at the Institute and all were fairly cooperative and interested in 
the test. The authors concluded that “the personality inventory gives re- 
sults indicative of poorly adjusted personality when the individual receives 
a high score; when the individual receives a low score, we are not justified 
in drawing any conclusion concerning the satisfactoriness of his adjust- 
ment to life. On direct validation, we found that about three-fourths of 
the self-descriptive statements given by the neurotic individuals really agree 
with the externally ascertained facts concerning those individuals.” 

On the whole, it may be said that the standardized test has opened an 
avenue in the educational field which before was hardly accessible because 
of the subjective methods of research which necessarily were used. This 
facilitation of research has added greatly to the precision and care with 
which materials are now devised, and may be said to be “responsible for a 
great part of the changed attitude toward tests and testing. Education is 
gradually becoming a science and its practices are being based upon care- 
fully determined facts. 

Guidance—The topic of guidance is so all-inclusive that it seems wise to 
consider several different aspects of it. Williamson (267) defined educa- 
tional and vocational guidance as “a process of assisting pupils to select 
and become informed about that occupational field which is consonant 
with their demonstrated aptitudes, interests, and experiences, and to secure 
training in line with this choice up to the limit of their educability.” 

In every classroom there are those pupils who do not respond to the 
normal routine. Their difficulties may be caused by obvious factors or by a 
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subtle maladjustment whose symptoms are not outwardly manifested. Fo; 
this reason, guidance is not to be concentrated on vocational and academ) 
factors alone, but is to include all possible aspects of the individual. Fo;. 
tunately, the progress of guidance programs has been facilitated by the 
development of tests of personality, attitudes, character, etc., which, a]. 
though in need of much greater perfection, give promise of real useful. 
ness in the future. The use of such scales, together with other data collected 
over a long period of time, according to Segel (255) “will furnish bette; 
means of predicting success in school or in different curriculums than we 
have hitherto had.” Heck (225) said, “Many attendance problems would 
never have been problems if a psychological division had been availabic 
where a child could have been ‘adequately studied and where an academic 
program more in line with his capacities, abilities, and interests could 
have been provided.” McConn (235) felt that the introduction of 
guidance program should begin in the secondary level because the high 
schools serve a much larger group. With this arrangement, the college 
guidance program might be carried on as a continuation of that started a: 
the secondary level with accumulated data available. Evans (218) re. 
ported such a guidance experiment carried on in the Pratt, Kansas, High 
School over a period of three years. The procedure was as follows: (a) 
the counselor acquired a mass of knowledge about the pupil’s interests from 
observation, interest tests, and personal interviews; (b) the counselor, 
pupil, and parents met at the pupil’s home to plan a schedule for tenth. 
eleventh, and twelfth grades and this schedule was filed away on a card. 
The results of such procedure after three years of operation showed: 
(a) the vocational counselor had become valuable to the school; (b) the 
pupil chose electives more wisely; (c) parents became a part of the schoo! 
system; (d) the pupil had incentive for doing better school work because 
he had a goal to attain; and (e) the principal was given a source of infor- 
mation which enabled him to calculate the demand for certain subjects and 
plan his class schedules accordingly. Newell (244) discussed the methods 
of child guidance adapted to a public school program. His purpose was 
to give a glimpse _of the case study method. He described several representa: 
tive studies considered by the psychological clinic. 

In summary, he says, “Our usual procedure with a new case is to under- 
take intensive work with school, home, and child for a few weeks. It is 
then often possible to discontinue active treatment, merely keeping the case 
under observation.” In every instance, the accumulation of essential infor- 
mation as a basis of guidance involved the use of tests and measurements. 

Controversy exists concerning the validity and reliability of scales pur- 
porting to measure traits of character. Typical of the attitude of the 
opponents is that expressed by Gillingham (223), “To extend the measure. 
ment idea to personality is to attempt to harness the infinite.” The op- 
ponents feel that character traits are too elusive and intangible to offer a 
reliable result of measurement and that the tests themselves are invalid. 
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On the other hand, there are those who are more optimistic concerning the 
jeasibility of measuring traits of character. Although they fully realize 
the shortcomings and imperfections of such scales, they believe that the 
future gives promise of their usefulness. Stephens (259) asserted that “the 
very foundation of all sound guidance programs is built on carefully ar- 
ranged personal data much of which consists of the results of various kinds 
of examinations or tests.” The Seventh Yearbook of the Department of 
Classroom Teachers of the National Education Association (242) pre- 
sented a discussion of personality tests and their uses. Especially note- 
worthy is the bibliography concerning major problems of elementary- 
school children and secondary-school children. We read here, “An effort 
to formulate tests and other devices for the discovery and measurement of 
factors related to personal development and adjustment has gone far 
enough to give promise of real usefulness in the future.” A bulletin of the 
Research Division of the National Education Association (243) stated that 
“although progress is being made, it seems unlikely that the status and 
needs of the whole personality can ever be discovered with the precision 
and economy of time now possible in the measurement of general intelli- 
gence and subjectmatter learning. . . . The foregoing statement does not 
imply that available testing procedures have little value in uncovering 
personality difficulties.” There follows a discussion of various studies of 
personality tests and testing. Segel (254) in speaking of the increasing 
number of attitude scales constructed under the direction of L. L. Thurstone, 
says, “These scales appear to be valuable instruments in the evaluation of 
effects of different environments upon people. These scales will make pos- 
sible a better approach to the measurement of the effect of motion pictures, 
radio, and newspapers.” Other references to personality tests and their 
uses were made by Tyler (264), Heck (225), Horn (227), Lincoln and 
Workman (232), and Woody (270). 

The clinical approach to guidance is cognizant of the individual as a 
composite organism, the study of which must necessarily take into con- 
sideration many different aspects. The dangers of interpreting results of a 
single test as final are pointed out by presentday clinicians, psychologists, 
and psychiatrists. A good description of clinical procedure was presented 
by Carter (213) who summarized the work of the psycho-educational clinic 
at Western State Teachers College, Kalamazoo, Michigan, over a two-year 
period and presented a reading case which he described as being “partially 
adjusted” at the end of the remedial period. Factors considered by the 
clinic in diagnosis were family history, developmental and medical history, 
school history, and clinical data. Faculty members of the Department of 
Education and Psychology acted as counselors for certain students assigned 
to the cases, and, in some instances, cases were handled by faculty members 
alone. Of the sixty children studied and treated at Western State Teachers 
College during this two-year period, 25 percent were designated as “satis- 
factorily adjusted,” 52 percent as “partially adjusted,” and 23 percent as 
showing no improvement whatsoever. Henry (226) also presented a 
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description of clinical procedure in a review of a behavior problem cop. 
sidered by this clinic. Robinson (249) discussed the responsibilities of , 
child guidance clinic in relation to mental hygiene. He views the clinic as 2 
medical institution whose primary responsibility is to the individual child, 
These clinics should participate fully in the mental hygiene movement, ever 
recognizing the needs of the community and ever coordinating home and 
school. Finally, educators will recognize, through clinical experience, that 
non-promotion is a major mental hygiene hazard, and that it may be neces. 
sary to sacrifice grade standards to protect the child’s mental health. 

Schools for which clinical service is available are indeed fortunate, but 
the lack of such service does not mean that the clinical approach to a 
problem cannot be used. For example, we read in the Seventh Yearbook 
of the Department of Classroom Teachers of the National Education Asso. 
ciation (242) : “The teacher of today must be more than an instructor. He 
must be to some extent an experimental psychologist, a diagnostician, and 
a guidance expert. He must be a practicing physician in the realm of per. 
sonality, interested both in curative and preventive medicine.” 

An essential part of the guidance program is a cumulative record system 
for the purpose of presenting college entrance information, counseling 
during school years and after school years, prediction of personal success, 
aid in choice of vocation, etc. The American Council on Education (206) 
has prepared a form for such a system. This form is discussed by Wood 
(269) who says, “The American Council Committee on Personnel Meth. 
ods advocates a method identified by its central and unifying instrumen- 
tality—the cumulative record of measures and observations recorded in 
comparable terms in a form convenient for rapid, accurate, and compre- 
hensive interpretation of growth trends, which is to be used carefully and 
continuously by all who take the responsibility of advising pupils.” Sig. 
nificant features of the record pointed out by Wood are: (a) it is arranged 
in calendar unit columns which facilitates detection and interpretation of 
trends in development and also gives cross sections of the individual’s re- 
corded status at any time; (b) the measures used are comparable from 
year to year giving meaningful indications of growth in any function tested; 
and (c) it provides spaces for objective measures of interests and per- 
sonality traits, for concrete evidence of interests and development, and for 
teacher’s judgment. A form has also been prepared by the Educational 
Records Bureau shorter than that of the American Council on Education. 
Summary 

By way of conclusion, it may be said that the summary of the previous 
pages bearing upon the present tendencies in the uses of educational 
measurements indicates clearly that tests and measurements for their own 
sake are rapidly passing from the educational picture. On the other hand, 
the use of tests as an integral and essential part of educational procedure 
and research is growing. Using tests for what they may contribute to the 
realization of the important aims of education and the solution of educa- 
tional problems appears decidedly to be the modern tendency. 
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CHAPTER Ill 


Objective Achievement Test Construction 


Tue LITERATURE on educational achievement test construction during 
recent years has exhibited a noticeable change in character. Before 1930, 
writers on test construction were concerned primarily with the compara- 
tive merits of the so-called “old” and “new” types of tests, with the 
statistical technics employed in the gross evaluation of tests, with the 
preparation, standardization, and statistical evaluation of specific standard- 
ized tests, with the development of new test forms, and with the experi- 
mental determination of comparative validities and reliabilities of objec- 
tive tests of various types or forms. Recently there has been some ten- 
dency, not as marked as might be desired, to recognize that the statistical 
technics ordinarily used in test evaluation are far more fallible and less 
meaningful than had been supposed; that the factors which account for 
differences between types of tests are far less significant than those which 
account for variations in quality between tests of the same external form; 
and that much more can be learned about these latter factors through 
internal analyses of test materials—through qualitative and subjective as 
well as statistical analysis of individual items—than through gross statis- 
tical evaluations or comparisons of complete specific tests. Consequently, 
there has been some diminution in the relative number of articles dealing 
with studies of comparative validity and reliability, particularly of the 
old and new types of tests, and of articles reporting the characteristics 
of specific standardized tests. A much larger proportion of articles than 
formerly has been concerned with item analysis technics, and relatively 
more attention has been given to qualitative and logical, as opposed to 
statistical, considerations in general. 


The Concepts of Validity and Reliability 


The early literature on testing was characterized particularly by much 
loose thinking and by many serious misconceptions concerning the nature 
of test validity and reliability, by undue confidence in the statistical tech- 
nics employed to measure these characteristics, by serious exaggeration 
of the relative significance of the reliability characteristic in test analysis, 
and by what has been described as the “jingle fallacy”—that of naively 
assuming without adequate evidence that a test really measures what its 
name implies that it measures. The literature of the past two or three 
years has contained several discussions which should contribute signifi- 
cantly to a clarification of current thinking in this area. In the opinion of 
the reviewers, each of the first three or four of the discussions reviewed 
in the following paragraphs is deserving of careful study by anyone seri- 
ously interested in the problems of test construction and use. 
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In discussing the statistical technics that are currently employed jy 
psychological test work, Thurstone (341) took every opportunity, as he 
derives and explains the statistical formulas, to draw attention to the 
limitations and abuses of, and to the misconceptions associated with, the 
statistical technics considered, and to develop a sound appreciation of the 
underlying concepts of validity and reliability. The character of his dis. 
cussion is indicated in part by the following quotations from hi: preface: 


Correlational methods have probably stifled scientific imagination as often as they 
have been of service. . . . In this country the reliability formulae have become a sort 
of fetish rather than a tool. . . . The logic of validity and reliability should be re. 
garded as a tool for the investigation of ideas and not as a sort of research pattern 
which by itself guarantees scientific respectability. . . . The student should be warned 
that while reliability coefficients are juggled as though they really were available. 


these coefficients are in themselves more or less in the nature of estimates and make. 
shifts. 


Turney (343) attempted “to justify a single definition of validity and 
a single criterion for judging validity.” He maintained that “validity is 
by its very nature determinable by no other means (than the judgment o{ 
experts) and the only statistical treatment which is essential to the estab. 
lishment of validity is that which will refine or assist in the consensus of 
expert opinion.” In a critical analysis of existing concepts he drew specific 
attention to a number of current erroneous notions concerning the nature 
of test validity. This well-written article should do much to secure more 
adequate appreciation of the important fact that the judgment evidenced 
by the test builder in the selection of the elements to be tested and the 
ingenuity he shows in the construction of the individual items testing fo: 
these elements are of far greater importance than the statistical technics 
he may employ. 

Monroe (325) ably criticized the uncritical use of objective tests by 
those who think (a) that complete objectivity in scoring is absolutely 
essential and that any objectively scored test is an accurate measure o/ 
achievement, (b) that, if a test is highly reliable, scores on it are neces- 
sarily accurate measures of achievement as specified by the announced 
function of the test, and (c) that a high correlation with a criterion is 
sufficient evidence to justify the use of the scores yielded by the test as 
highly accurate measures of the achievement considered to be defined by 
the criterion. He pointed out quite clearly the errors in these beliefs and 
urged more intelligent use of objective tests and more critical interpreta- 
tion of the scores obtained from them. 

Tyler (344, 345, 346, 347) laid particular stress upon the necessity 

ut of a careful formulation of the objectives of instruction in a given course 
before actual construction of a test is begun and upon the importance 
of building the test so as to measure the degree of attainment of each 
of those objectives. He used a description of the procedures followed in 
setting up such objectives in the Department of Zoology at Ohio State 
University, to illustrate the technic outlined. 
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Willoughby (353) dealt with the concept of reliability and how to meas- 
ure it; he stressed the fact that the essential factor in increasing the 
reliability of a test is the addition of highly intercorrelated items rather 
than the mere lengthening of the test. Stephenson (338) questioned the 
value of knowing the reliability of a test as we now measure it. Factor 
saturation, he claimed, rather than the reliability coefficient, is the im- 
portant thing to know concerning a test. 

Lindquist and Anderson (315), in a discussion of achievement testing 
in the social studies, attempted “first, to define as specifically as possible 
the particular functions which may best be performed by tests of the 
general achievement type; second, to show how the general achievement 
test in the social studies must be constructed if it is to perform these 
functions most effectively; third, to show why a test of this type cannot 
be made to fulfill other measurement purposes without detracting from 
its validity with reference to those functions which are distinctive to it: 
and finally, to present a detailed description of a test of this type and to 
illustrate concretely its more important characteristics.” The authors drew 
attention to important distinctions between validity of content for inclu- 
sion in the course of study, and validity of content for inclusion in a test 
of general achievement, as well as between the concepts of validity as ap- 
plied to diagnostic and general achievement tests. They also provided con- 
crete illustrations of factors which contribute to high and low validity 
in individual items in general achievement tests in the social studies. 

The fact that there may be a difference between the validity of a test 
when used to measure status and its validity when used to measure change, 
the difference between two measures of status, was pointed out by Wat- 
son (350). He recommended that good tests, intended to be used “before 
and after,” should furnish coefficients of reliability and validity for change 
scores as well as for status scores. Worcester (355) suggested the need of 
research to discover just what type or kind of question is actually asked 
in the classroom during recitations and in life outside of the school, so 
that similar questions may be used in tests in order to gain validity. 
Illustrations of items showing the difference between purely factual and 
thought, or interpretation, questions are contained in a discussion by Price 
(327) of tests, objective and otherwise, in the social studies. 


Administrative Factors Affecting Test Validity and Reliability 


The fallibility of gross statistical measures of test validity and relia- 
bility when secured under uncontrolled conditions has been demonstrated 
by a number of studies which have attempted to measure the influence of 
certain factors of test administration upon obtained validity and relia- 
bility coefficients. 

1. Rate of administration—Cook (286) and Lindquist and Cook (316) 
have demonstrated that both the validity and the reliability coefficients for a 
given body of test materials are dependent largely upon the time in which 
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these materials are administered, and that unless this time factor is con. 
trolled, studies of the comparative validity and reliability of different tes 
materials are likely to prove inconclusive. They defined “optimum adminis. 
tration time” as the length of time of administration of a test in which 
the highest validity is secured per unit of time and proposed an experi. 
mental procedure for determining this optimum time empirically. They 
investigated the relationship between administration time and validity 
and reliability for six forms of a spelling test, and showed that the nature 
of this relationship varied from one type of test to another. 

Tinker (342) wondered if the “speed attitude,” the feeling of working 
against time, might lower the validity of tests. He gave two forms of the 
Army Alpha test to 221 freshmen and sophomores with different time allow. 
ances and found that the scores were almost as high and the tests equally 
valid when the students worked against a time limit large enough to allow 
all of them to attempt most of the items as when there was no time limit 
at all. He concluded that it is all right for students to have the speed atti- 
tude when taking tests, i. e., to be working under pgessure, provided a rea- 
sonable time is allowed for the test. Caldwell (2h) found very similar 
results with seventh-grade children on the Stanford Achievement Reading 
Tests. He analyzed the results on various intelligence levels and found 
that the students of lower intelligence required more time than the more 
intelligent pupils to reach their maximum scores. 

2. Practice on tests—Anastasi (272) conducted an experiment with col- 
lege students which showed that test reliability increases with practice on 
the test when the test is administered by the time-limit method and relia- 
bility is measured by the odd-@ven technic. She discussed four factors 
which sometimes cause variations from the above generalization. Her report 
includes an exhaustive summation of the experimental evidence on the 
problem. 

3. Instruction between test administrations—Copeland (287) explained 
the data found by three other experimenters, in which tests were less reli- 
able when given at the end of a teaching period than when given at the 
beginning of the period, on the basis of decreased range of talent at the 
end of the period due to the instruction received. 

4. Directions on true-false tests—Weidemann and Newens (351) per- 
formed an experiment to find the effect of provision of different sets of test 
directidfis for true-false and indeterminate statement tests upon the time 
required for the administration of the test, upon the reliability of the test, 
and upon the central tendencies and variabilities of the distributions of 
test scores. Few significant differences in performance were found between 
tests with various sets of directions. The final conclusion of the experi- 
menters was that one should “use a set of directions for true-false and in- 
determinate statement examinations whose definitions for response on the 
decision scale correspond to the nature of the instruction for each item 
of the course.” 
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5. Arrangement of items according to difficulty—Using 453 pupils in the 
Minneapolis fifth and eighth grades, Capron (283) found no significant 
differences in performance due to the arrangement of items in objective 
tests in spelling, arithmetic problems, and the fundamental arithmetical 
processes. Easy-to-hard, hard-to-easy, and random arrangements were used. 


Measurement of Validity and Reliability 


A comparison of the advantages and limitations (due to assumptions in- 
volved in their use) of the tetrad technic, the split-halves technic using the 
Spearman-Brown prophecy formula, and simple correlation of two forms 
of a test as measures of reliability was made by Dunlap (294). He con- 
cluded, among other things, that “the Spearman-Brown formula will give 
a very close approximation to the reliability of the total form, as split 
halves will in general be approximately equally reliable.” Brownell (279) 
would be skeptical about this. statement, especially if applied to compara- 
tively short, non-standardized tests. He obtained variations from about 
30 to .55 in the reliability coefficients of single tests measured by the 
split-halves method and the Spearman-Brown prophecy formula when the 
test items were arranged in different orders. Such evidence is taken to indi- 
cate the necessity of considering carefully the assumptions underlying this 
technic when using it to measure reliability. Brownell pointed out that 
most texts on statistics say very little about the conditions under which 
the formula may properly be used, and consequently much misuse results. 

Handy and Lentz (299) developed a formula to express the reliability 
of a test in terms of the discriminating value of the individual items, these 
values, in turn, being computed by comparisons of the responses to each 
item by the subjects who ranked in the upper 30 percent and in the lower 
30 percent on the entire test. Dickey (293) worked out a formula for 
estimating the reliability coefficient of a test for one range when its relia- 
bility coefficient is known for a given range. Because the standard error 
of estimate of his formula is smaller than that of the similar formula 
developed by Kelley for the same purpose, Dickey claimed it is the better 
of the two. Cureton (291) discussed a method and appropriate formulas 
for finding the reliability of a fallible criterion, such as college success 


or a series of judgments, against which the validity of a battery of tests 
is to be computed. 


Consistency of Responses 


Dewey (292) gave a reading comprehension test composed of several 
different types of objective items, with several items (of different types) 
testing each idea, to fifty-five eighth-grade students. The consistency with 
which answers were made, either right or wrong, ranged from less than 
50 percent among the least intelligent students to less than 67 percent 
among the most intelligent. The conclusion was that a response to a single 
item or even a single type of item is a questionable measure of a pupil’s 
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understanding of an idea. Similar findings were obtained by Brueckner 
and Elwell (281) in tests on the multiplication of fractions. They recom. 
mended including three or four examples of each kind in a diagnostic 
classroom test in arithmetic because of the many errors due to factors 
other than the understanding of the correct procedure required to solve 
the problem. Brueckner and Hawkinson (280) followed up this study by 
experimenting to see whether or not placing all four examples of one type 
together in a row (which procedure would make scoring and analysis much 
easier than if the related items were scattered throughout the test) would 
affect the number or kind of errors made. It did not, and so such grouping 
of similar items is advocated. 


Comparative Validity and Reliability of Specific Test Forms 


The literature on testing during recent years has contained a decreasing 
proportion of studies concerned with experimental determination of com. 
parative validities and reliabilities of various types of objective tests, 
Most of the studies of this character which have appeared, including a 
number of those which are reported below, are subject to certain general 
criticisms which may in part account for the decreasing activity in this 
field of research. These general criticisms are as follows: 

First, the studies of comparative validities and reliabilities thus far re- 
ported have in most cases attempted to determine the reletive effectiveness 
of the various technics in general rather than in relation to specific fields 
of subjectmatter or in relation to any specific objectives of any given field. 
Whatever advantages or superiorities any type of test may have, however, 
are specific advantages in specific situations. Generalized conclusions not 
only are of little or no positive value but may even be definitely misleading, 
since they may result in a general condemnation of types of tests which 
in specific instances or for restricted purposes might be highly valuable. 
In order to be of value to the test constructor, comparative studies must 
determine the relative effectiveness of various technics for highly specific 
purposes. Few studies of this type have yet been made. Because of their 
highly generalized nature, the majority of studies thus far reported con- 
tain little or nothing of value to the test technician in the solution of speci- 
fic problems. 

Second, the majority of reported studies of the relative effectiveness of 
various testing technics have based their conclusions concerning the “rela- 
tive effectiveness” of the technics investigated primarily upon determina- 
tions of their comparative reliability or self-consistency. Objective and de- 
pendable descriptions or measures of validity are often extremely difficult 
to secure because of the lack of any acceptable independent criterion of 
validity. Reliability coefficients, on the other hand, can usually be easily 
determined. For this reason, there has been a tendency to give undue promi- 
nence to the concept of reliability. The reliability or self-consistency of 
any test or testing technic, however, is of very minor significance in com- 
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parison to the validity of that technic in relation to the specific purpose 
for which it is intended. Considering the fact that the reliabilities of a 
number of tests intended for the same purpose may, in some instances, 
even be negatively related to their validities, experimental comparisons 
concerned primarily or exclusively with the reliability characteristic are 
likely to contain little of value to the test builder. 

Finally, the majority of comparisons between various technics in test- 
ing that have been reported have failed to control certain important fac- 
tors which of themselves could readily account for the differences which 
have been found. Among the most important of these factors is the skill 
or ingenuity of the test builder. Investigator A, for example, who is inter- 
ested in measuring the amount of scientific information acquired by high- 
school pupils in general science, builds a true-false test and a matching 
test over the same items of information. In this situation he might show 
that his true-false test is more reliable and valid than his matching exer- 
cises. This may only prove, however, that A is more ingenious in the 
construction of true-false tests for this specific purpose than he is in the 
construction of matching exercises, and may show nothing at all concerning 
the relative effectiveness with which these technics may be employed by 
other test constructors in the same or in another situation. The validity 
of any test in relation to a given purpose is far more a function of the 
skill or ingenuity evidenced in the application of the technic used than 
it is of the type of exercise employed. Where this factor of skill or in- 
genuity im test construction is left uncontrolled, comparisons of obtained 
measures of validity or reliability of two types of tests may only show 
the degree to which the test builder has realized or approached the ultimate 
possibilities of each type of test in the situation involved, and may not 
indicate at all how they would have compared had their respective pos- 
sibilities been fully realized in that situation. There are few studies in which 
this factor has been even recognized and practically none in which it has 
been adequately controlled. Considering, furthermore, the extreme diffi- 
culty of controlling this factor in the experimental situation, it seems un- 
likely that empirical studies will contribute very much to the better evalua- 
tion of the various types of objective test exercises. An exception to this 
generalization may be noted in the case of those few fields where it is 
possible to provide a detailed and objective description of the manner in 
which the individual items are constructed, i.e., those fields in which the 
ingenuity of the test constructor plays a relatively minor role or can be 
held constant, as in tests of spelling, of the basic arithmetical skills, and 
of the mechanics of correct writing. Another factor frequently left uncon- 
trolled in experimental comparisons of the type here considered is the 
factor of administration time, which has been discussed briefly in an earlier 
reference. The foregoing considerations should be kept in mind by any 


reader who has occasion to refer to the articles briefly summarized in the 
following paragraphs. 
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Cook (286), in a carefully controlled experiment, determined the rela. 
tive validities of six forms of a spelling test when each was administered 
at its optimum time. In general, the recall types were found to be superio; 
to recognition tests and the latter to be improved in validity by corrections 
for guessing. 

A five-response multiple-choice test of word meanings, a “same-opposite. 
neither” test, a matching test, and a multiple-choice test in which the 
stimulus word was given in a sentence were compared in validity by 
V. H. Kelley (310) by correlating them with a criterion test in which the 
subjects were asked to use the words in a sentence. The optimum testing 
time for each test was determined by experimental procedures and each 
test given in the time which gave it maximum validity per unit of time. 
The tests were compared in difficulty and in validity as measured by the 
percent of agreement between each and the criterion test on individual 
items. The same-opposite-neither test and the multiple-choice sentence tes; 
were lower in validity, though there was not a significant difference between 
them and the others. The matching and multiple-choice tests were most 
valid, and the latter most difficult. 

Sims (332) experimented with 5-, 10-, and 15-item rearrangement tests 
on the college level, each item consisting of 5, 10, or 15 headings taken 
from a psychology text which were to be arranged in logical sequence. 
and found that they compared favorably in reliability and correlated 
highly with other measures of achievement in psychology and of intelli- 
gence. The 15-item type took longer to write and considerably longer to 
score but was more reliable than the other two; it was recommended in 
cases where only a limited number of items was available. Weller and 
Broom (352) studied the validity of six types of spelling tests. They 
found that the proof-reading type of test measured something different from 
what is measured by the recall type of test. The sentence-dictation test was 
considered most valid. Correlations between types were low; within types. 
high. 

Because he found that individuals were often inconsistent in their answers 
to items similar in content but different in form, Magill (320) maintained 
that the burden of proof rests upon those who hold that true-false, multiple- 
choice, and recall tests are of approximately equal validity and therefore 
use them indiscriminately. He gave tests of these three types covering the 
same information to two groups of teachers in service; intercorrelations 
of resulting scores were high, but individual responses were very incon- 
sistent. Corrections for guessing did not change his results materially. 
Hurd (308) compared a short-answer with a multiple-choice type of 
science test in validity, reliability, and difficulty. The tests covered identi- 
cal subject content. The reliability coefficients computed by the split: 
halves method were .93 for the short-answer test and .82 for the multiple: 
choice test. The correlation between them was .78. The latter test was the 
easier of the two. The short-answer type was said to be more valid because 
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larger gains were made on it between the first administration of the test 
(before instruction) and the second administration (after instruction) than 
were made on the multiple-choice test. (This seems to be a doubtful crite- 
rion of validity, especially in view of the difference between the tests in 
difficulty of items.) 

Innovations in test form—The true-false test has been most often 
“improved.” Briggs and Armacost (276) found that oral true-false tests 
for immediate recall compared very favorably with such tests presented 
in visual form. The reliability of a 50-item test which they gave to a 
college class in junior high-school procedures was aboui .69; stepped up 
to 100 items, it would be .81. Its obvious advantage is the saving of time 
and expense involved in mimeographing copies of the test. 

An attempt has been made to improve the true-false test by requiring 
the student not only to recognize a false statement when he sees it but also 
to know enough about why it is false to be able to correct it. Horn (304) 
described what she called the “variable answer” test in which the students 
were to correct the false items; scoring was somewhat subjective. Andruss 
(273) required his students to correct the false items by crossing out just 
one word and putting another in its place. Then he built his test so that 
there could be only one correct one-word correction. It made test-building 
rather difficult but secured objective scoring and combined the advantages 
of the true-false, multiple-choice, and completion tests. The test was 
scored by awarding one point for correctly indicating that an item was 
false, one for crossing out the right word, and one for putting the correct 
word in its place. Hevner (301) increased the reliability of a true-false 
test by allowing the pupil to indicate his confidence in the correctness of 
his answer and weighting each response according to the degree of confi- 
dence shown in it. 

The multiple-choice form of test has also been shown to be usable 
when presented orally. Sims and Knox (333) checked 3-, 4-, and 5-response 
orally presented tests against a 5-response visually presented test as a 
criterion, and found them to be a little more difficult and a little less 
reliable than the visual test but not so much so as to invalidate their use. 
The 5-response oral test was superior to the other two. Scheidemann (331) 
suggested asking the pupils to indicate all correct responses to a multiple- 
choice item rather than the one correct response. Frutchey (297) explained 
a practical modification of the multiple-choice type of test for the measure- 
ment of ability to apply chemical principles. 

Some investigators have reported experiments with test forms widely 
different from the commoner types. Desiring to measure application of 
information, Smeltzer (334) compiled tests in which each item consisted 
of a description of a practical situation and several suggestions as to what 
should be done by the student in such a situation. Students rated the answers 
on a scale of one to five. The keys were based on the judgment of several 
professors, and errors on the part of those taking the test were measured in 
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terms of their numerical deviations from the key on each response. Brooks 
(277) described a test used in his history of education and introduction to 
education classes in which he lists 100 terms and asks the pupils to under. 
line all the terms they can define and then to go back and actually define 
every fifth one. The percent of correct definitions multiplied by the 
number underlined gives the score. The procedure saves time; of course, 
the scoring is subjective to the extent that the terms listed may be variously 
defined. 

Miscellaneous considerations—Lamson (313) raised the question of 
what happens when a student changes his answer on a true-false test. 
Her conclusion, based upon 144,000 items from 1,511 papers of college 
students, was that “it is better to record a second judgment in a true-false 
examination than not to record it. The chances are two to one that the 
second judgment will be the correct judgment. It is much safer to change 
a judgment from true to false than vice versa.” 

Stalnaker and Stalnaker (337) found that selected distractors (words 
commonly confused with the word in question) in a 5-choice best-answer 
vocabulary test were marked more often than chance distractors (those 
selected at random). There was no interference with discrimination, so 
their use is recommended. 


Item Validity and Reliability 


Methods of determining item validity—The importance of measuring the 
validity, or effectiveness, of a single test item as an aid in the construction 
of more valid tests has been recognized by investigators, and a variety of 
methods for computing an index which will give an objective description 
of the worth of an item have been suggested and used. One of the principal 
limitations of most of these indexes is that they are concerned only with 
the correlations between a single criterion and the responses to individual 
items, and do not take into consideration the intercorrelations between the 
item responses. Theoretically, the best test is that in which the individual 
items correlate highly with the criterion but show relatively low inter- 
correlations. An item with a high index may, therefore, prove less valuable 
when used together with other highly related items than another item which 
has only a medium index but which shows low correlations with other 
items in the test. Another practical limitation of the indexes prepared is 
that their worth depends upon the validity of the criterion employed, and 
that reliable independent criteria are not often available. When the crite- 
rion employed is the total score on the test itself, the index for an individual 
item is strictly a measure of the extent to which the item contributes to 
the reliability rather than to the validity of the whole test. For these reasons. 
extreme caution must be exercised in the interpretation of results with 
most of the procedures suggested in the articles reviewed in the following 
paragraphs. 

Lindquist and Cook (316) set up five criteria for an index of discrimi- 
nation and compared five different indexes with reference to those criteria. 
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They found the bi-serial r index of discrimination to be the best indication 
of the value of an item in contributing toward the ranking of students , 
according to general achievement. The calculation of this index requires a 
great deal of computation, however, and is therefore rather impractical in 
situations where machines are not available for the purpose. A simpler 
index which may be calculated without the use of machines and which 
gives quite satisfactory results was suggested as an alternate procedure. 

Richardson and Stalnaker (329) suggested and gave the derivation of 
a formula for the computation of a bi-serial coefficient of correlation 
which they claimed was based on more suitable assumptions than the one 
usually used in analyzing test items. Zubin (357) reviewed three methods 
of internal validation of test items: the critical ratio, bi-serial r, and 
association methods. Formulas were worked out to correct for errors due 
to the customary practice of including the item under analysis in the total 
score when calculating these indexes. Zubin rated the three methods 
roughly on the basis of ease of application, limitations, and underlying 
assumptions. 

Long (317) devised an improved overlapping method of measuring 
item validity which correlated very highly (r of .94) with the bi-serial 
index on a 110-item general science test. The method depends upon no 
statistical assumptions, is easy to understand and to compute, is not 
affected by the difficulty of the item, and has been found to be slightly 
superior to the bi-serial r index of discrimination in selecting items which 
correlated highly with the criterion used. Long also developed a further 
“weighted overlapping formula” which has even higher selective value; 
it is weighted according to the number of “differentiations” an item makes. 

Statistical theory and formulas for the evaluation of test items by the 
method of successive residuals, which takes into consideration discrimina- 
tion plus intercorrelation of items so as to avoid having the items all test 
exactly the same thing, were developed by Horst (307). He also illustrated 
a work sheet for use in carrying out the mechanical operations involved in 
evaluating items by this method. 

Votaw (349) worked out a method of correcting for guessing when 
measuring the validity of test items so that comparisons between perform- 
ances of good and poor students would be measured by the.praportions of 
the two groups who really knew the answer rather than by the proportions 
who answered it correctly. The result of the use of this plan in evaluating 
items was the retention of more items which were selective at the low end of 
the distribution so that better discrimination among the poorer students 
was obtained. A scheme was devised for detecting items which had “tricked” 
good students into making incorrect responses. 

Votaw (348) explained a technic for working out a graph with which 


to determine the selectivity of test items. Arnold (274) presented a 
variation of this in chart form. 
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Application of data on item validity—The determination of an index 
is merely a means to an end. The question of how to make use of it in test 
construction is an important one upon which there is little experimental 
evidence. Smith (335) gave a 200-item vocabulary test to 370 bright 
eighth-grade children, computed the bi-serial r indexes of discrimination 
of the items, and ranked the items according to these indexes. Then he 
made four subtests of 20 items each (using the 20 highest ranking items 
in one subtest, the 20 lowest ranking in another, and two intermediate sets 
of 20 items, one above the average and one below, for the other two sub. 
tests) and used the remaining 120 items as a criterion test. All 200 item: 
were given to two more groups of 500 pupils each. The entire test was 
more valid when the subtest of lowest ranking items was omitted, but the 
omission of any of the other subtests lowered the validity of the test as a 
whole; therefore, Smith concluded that in improving a test the “worst” 
items should be discarded but all items with indexes above .40 should be 
retained even though there are many items with much higher indexes. In 
other words, although these items are of medium low validity as measured 
by the index used, they measure something which needs to be measured 
and which the items with higher indexes evidently fail to measure, and so 
they should not be discarded. 

Lindquist and Anderson (315) discussed the factors which affect the 
discriminating power of items and gave specific illustrations of items 
which proved ineffective in actual practice, pointing out the reasons for 
their failure to discriminate well. 

Relation between item difficulty and validity—Henry (300) found no 
significant relation between the difficulty of an item and its validity as 
measured by the bi-serial r method, the Clark method, the Vincent method. 
the upper and lower thirds method, and a combination of them all. Horst 
(305) developed a formula to express the difficulty of a multiple-choice 
test item as a deviation from the mean of the group in standard deviation 
units. He deduced the idea that multiple-choice items with fewer alterna- 
tive responses which are equal in difficulty are more valid than those 
with a larger number of choices presenting a wide range of difficulty. 

Reliability of test items—Holzinger (302) worked out formulas giving 
the standard error of response or measurement of a single test item. He 
stated : 

The response error of series of items is already known. The problem, then, is to 
find this error for a single item and show the relation to the standard error of a series 
of such items. It is hoped that these new formulae will be of some value in building 
up tests of items of known reliability and in predicting the final reliability of a 
number of such items when combined. Conversely, it is believed the formulae may 


be useful in appraising the reliability of tests of different lengths by reducing the 
measure of reliability to the average item basis. 


Collection of data for test validation—Horst (306) described a proce- 
dure for building a test of personality and mental alertness in such a way 
as to try out a large number of items with a minimum of test administra- 


480 


—4 © Be be Ullal 





tion. He divided his try-out group into two groups, gave each a set of items 
of two types, and then predicted what the score of each individual would 
have been had he taken a test on all the items of one type. In this way 
twice as many items may be tested as is usually done. The method and 
formulas may be applied to any type of test. 


Scoring 


1. Errors in scoring—Rauth (328) compiled data showing that errors 
are frequently made in scoring objective tests, and listed the most common 
types of errors made in scoring various test forms. 

2. Methods of scoring common forms of objective tests—The use of 
mechanical devices and scoring forms in an effort to save time and money 
and to be completely objective represents the trend in scoring procedure. 
Bawden (275), Cuff (290), Fay and Middleton (296), and Pressey (326) 
have reported the use of scoring forms, separate sheets of paper ruled in 
various ways, upon which the pupils indicate their answers. This plan 
permits the same test papers to be used again and again. Such score forms 
may be used with any of the more common types of objective tests; 
ingenious suggestions for scoring them are many. Cuff (290) ran the 
answer sheets through a mimeograph so arranged as to encircle the correct 
responses and make speedy tabulation of the results possible. Fay and 
Middleton (296) and Pressey (326) recommended a cardboard stencil 
key. Pressey also described a machine procedure which secured speedy 
and accurate results—a part of what he called “the coming ‘Industrial 
Revolution’ in education.” 

Cuff (289) told of a mechanical device for scoring multiple-choice test 
items by weight, which is from 10 to 40 times faster than scoring by hand 
and very much more accurate—practically perfect. Other forms of objec- 
tive tests undoubtedly could be scored similarly. 

3. Methods of scoring less common forms of tests—Elderton (295) pre- 
sented objective methods for the scoring of maps drawn by students and 
for the combining of time intervals and errors into a single score on tests 
of mental imagery. 

Brown, Bartelme, and Cox (278) explained and advocated the use of a 
procedure for scoring tests in which items have been so scaled that the 
scale values of the items missed and passed can be taken into consideration. 
The score of the individual is taken as that point on the scale where the 
average deviation of the correctly marked items above it equals the average 
deviation of the incorrectly marked items below it. Examples are given to 
indicate the superiority of this method over the one commonly used by 
Thurstone and others, which involves the scale positions rather than the 
scale values of the items. 

T. L. Kelley (309) developed a formula to determine the weight to 
assign to a given response in a test designed to compare a person’s interests 
with those of some homogeneous group or class of people. 
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A special 68-item examination devised to measure the ability to evalu. 
ate two types of objective test items was given to a University of Arkansas 
class in construction and evaluation of objective test items by Gerberich 
(298). The examination was scored by four methods. The optimum method, 
as determined by its reliability and correlation with a final examination 
and term grades, involved a weighting of one point for each “good” item 
so designated and two points for each “poor” item so designated. 

4. Corrections for guessing—Melbo (321) analyzed the results of a 
50-item true-false test given to 1,480 high-school and college students and 
decided that the right-minus-wrong formula to correct for guessing was 
fallacious—that it was better to consider the number of correct responses 
as the score. Such results would have been expected in view of a study by 
Kruege (312), who made an empirical check on the laws of chance and 
showed that the right-minus-wrong formula provides a fairly valid correc- 
tion for guessing with tests of several hundred items but that in tests of 
less than 100 items many spuriously high and low scores will be obtained 
in spite of the use of a correction formula. 

Zubin (356) developed formulas to correct for guessing in determining 
the mean and the standard deviation of a matching test. They are quite 
simple and easy to use. They may be applied to individual scores provided 
the number of questions is sufficiently large. 


The Influence of Objective Examinations 


1. Upon instruction—The March, 1935, issue of the Journal of Educa- 
tional Research (354) contains the opinions of a number of leading 
testing authorities concerning the effects of measurement upon instruction. 
The general opinion seems to be that the influence of measurement upon 
instruction may be either beneficial or harmful depending primarily 
upon the nature and quality of the measuring instruments used. 

2. Upon learning—The so-called negative suggestion effect of true-false 
items has been a subject of much research. Keys (311) found slightly 
harmful suggestion effects from false items in an educational psychology 
test. McClusky (319) showed clearly that college students miss false items 
more often than true ones and compiled data from which he concluded 
that there were detrimental negative suggestion effects from true-false 
tests, though he recognized that, in any case, they were temporary in nature. 
Sproule (336) carried on experiments at the fifth-, seventh-, and ninth- 
grade levels and concluded that “the present evidence does not justify a 
condemnation of the true-false test on the basis of false impressions that 
it produces.” He found that allowing students to correct their true-false 
tests offset practically all negative effects and contributed to positive 
learning, so he decided that it was safe to use such tests as low as the 
fifth-grade level if students were allowed to correct their papers. Similar 
decisions were made by Ross and Pirie (330) after experimenting with 
college students. They found that true-false tests did not inculcate false 
impressions when the correct answers were given to the students after the 
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test. Thus the evidence would seem to indicate that the true-false test, when 
properly used, is a valuable instrument for instructional purposes. 

Meyer (322) conducted an experiment which showed that when pupils 
study for a recall test they retain the knowledge longer than when they 
study for a recognition test, so that it is best not to let them know ahead 
of time that they will be given a recognition type of test. McClusky (318) 
illustrated the value of correcting tests in class and of discussion at that 
time as an aid to maintenance of achievement. 

3. Upon study habits—Studies by Meyer (323), Class (285), and 
Terry (339) have shown clearly that students study differently in prepa- 
ration for different types of examinations, a fact which should be known 
and considered carefully by instructors when deciding upon their testing 
procedures. Students study for details when preparing for objective exami- 
nations; they study the larger aspects of a subject—to get a general 
picture of the material—when an essay test is announced. Meyer (323) 
found that in studying history they draw maps and make summaries 
more when studying for an essay test and do less underlining, and 
that individuals studying for completion tests study harder and make out 
sample test questions more often than those studying for recognition tests. 
All of these studies have been made with college students; the generaliza- 
tions may not apply to pupils at lower grade levels. Crawford (288) has 
compiled a list of suggestions on how to study for objective tests in general 
and for true-false, completion, and matching tests in particular. 


Bibliographies on Test Construction 


Lee and Symonds (314) have written an excellent summary of investi- 
gations concerning objective tests reported between October, 1931, and 
October, 1933; they included a bibliography of 104 references. Holzinger 
and Swineford (303) have compiled three annual annotated bibliographies 
of selected references on the theory of test construction. 


General Studies 


The Committee on Educational Research of the University of Minnesota 
(324) published an account of the experience of that institution with the 
construction and use of the examinations employed in the new General 
College and in other departments of the university. The chapters by Alvin 
C. Eurich, Edgar B. Wesley and Renata R. Wasson, Henry Kronenberg 
and Edgar B. Wesley, Alvin C. Eurich and Francis S. Appel, and Clara 
M. Brown particularly contain good suggestions for the construction of 
objective test items. The appendix contains an interesting collection of 
sample test items from the Minnesota examinations, which should provide 
valuable suggestions to the test builder. 

A similar and more exhaustive and varied collection of sample test 
items was prepared by the Board of Examinations (284) of the Univer- 
sity of Chicago. The latter collection is prefaced by a brief introduction 
containing suggestions for the construction of objective test materials. 
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CHAPTER IV 


Recent Developments in the Written 
Essay Examination 


From 1923 to 1935, research studies on the essay test may be grouped 
as follows: 


1. The reliability and validity of essay marking, including comparisons with objec. 
tive tests 


2. The measurement of the same or different mental functions tested by essay and 
objective tests 


3. Student preparation for and reaction to essay and objective testing programs. 
Reliability and Validity 


Wood (389) pointed out that the new type (in this case the true-false | 
is more reliable when compared with the old type essay examination, and 
further stated that the essay type “is most apt to measure: cogency of 
expression, organizing acumen, and reasoning ability.” In a second report. 
he emphasized that “the essay examination is best adapted to the measure. 
ment of critical capacity and reasoning ability. . . . The best essay exami- 
nation is the one which allows the student to choose two questions out of 
five. . . . The essay law examination is indispensable . . . and... it 
can be improved.” 

Sims (381), using “the distinctly good questions” collected by Monroe 
and reported by Odell, found that 34.5 percent were simple recall, 35 per- 
cent were short answer, and that only 30.5 percent were discussion ques. 
tions. Sims stated that the first two are definitely objective in nature and 
should be built according to established principles of objective test con- 
struction, while the third type is more subjective. He suggested that a 
more satisfactory method of marking the subjective type should be devel. 
oped. In another paper he (384) indicated how to reduce by one-half the 
range and probable error of the marks given by different graders. He 
considered the scores given on essay examinations as raw scores and con- 
verted them into the particular grading system of the school by setting the 
passing point at 1.5 standard deviation of the raw score. Using this proce- 
dure, he (383) compared two forms of an essay test of ten questions each 
with two forms of an objective test each consisting of 34 completion and 
40 true-false questions which were given to 80 students in general psycho!- 
ogy. The average coefficient of objectivity of rating the essays was .77. 
their reliability was .72, and the correlation between the essay and objective 
tests was .70. 

Weaver and Traxler (387) based their study upon 5 objective and 5 
essay tests made upon 2 units of history and given to 38 pupils. They 


484 





found correlations from .30 to .60 between the tests and a combined 4-essay 
test as a criterion and again with a combined 4-objective test as a criterion. 
Separate test pair reliabilities were not computed. The essay type question 
used by Weaver and Traxler would be classified by Sims as calling for an 
objective short answer. 

The viewpoint of Sims must also be borne in mind when considering 
Leighton’s suggestion (371) for reducing the sampling error of the essay 
test by the addition of a large number of short answer questions. He found 
that when factors to be measured are carefully determined before con- 
structing the essay examination, when the questions are planned and clearly 
stated, and when criteria for judgments are set up, there is great increase 
in the agreement of scorers’ grades. With these criteria, two graders using 
the subjective method of grading papers as a unit were able to correlate 
.63 when grading very subjective material like that of philosophy. They 
used a 1, 2, 3, 4, 5, fail, ranking of papers. Graders of biology, when using 
the more objective method of grading each question separately, corre- 
lated between .63 and .90 in their judgment of the papers. 

Tharp (386) compared old and new type foreign language grammar 
tests with a third semester test consisting of both essay and objective ques- 
tions. He found a correlation of .90 for the objective and .83 for the essay 
test when compared with the grades given in the course as determined by 
the final examination. He suggested that the objective test will measure 
more accurately than the essay test alone. 

Eells (366) had 61 teachers grade the same essay test material at the 
beginning and at the end of an eleven-week interval. He found the variabil- 
ity of judgment between the same individual to be as great as the varia- 
bility between different individuals. Correlations ranged from .25 to .51. 
The material used consisted of the geography questions from Ruch’s experi- 
ment and the history material from a previous experiment of Paulu. 

Over a period of two semesters, Peters and Martz (379) made an elabo- 
rate study of short true-false, multiple-choice, completion, and essay tests 
in elementary and secondary subjects and correlated them with the final 
grade given in grades 2 to 12 inclusive, involving 252 students. They 
concluded from the 196 correlations computed that completion and essay 
tests do not vary greatly in validity where the criterion of final grade is 
used. The true-false test was slightly less valid than the completion and 
essay tests, especially in the elementary grades. The multiple-choice and 
essay-discussion tests were equally valid in the elementary school, but in 
high school the essay was more valid than the multiple-choice test. The 
criterion of final grade with which each of the ten-minute experimental 
tests was compared was determined one-third by teacher-made objective 
tests and the ones mentioned above. The latter actually contributed about 
one-twelfth of the criterion. 

McKee (372) equated two groups of 50 students for intelligence in 
elementary and advanced freshman English and compared their semester 
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grades, determined for the most part by a subjective estimate of written 
work, with a prognosis made first by an objective test and with a second 
prognosis made by the use of a theme. Both the test and theme were written 
at the beginning of the semester. He found the objective test to be a far 
more reliable prognosticator of superior students in freshman English 
than the written theme. 

Odell (376), in an extensive study involving 23,500 pupil answers to 
thought questions scored by 57 raters, concluded that the reliability of 
ratings given with scales was not significantly higher than that given 
without scales. He further found that with raters having no teaching experi. 
ence the scales tended to assist in fixing the general test standards but 
they did not increase the reliability of the actual rating of single answers. 
This finding is similar to that of Ruch and Stoddard (380) but opposite to 
that of Hudelson (368). 

Monroe and Souders (375) compared two sets of examination papers 
prepared by pairs of teachers working together and again working indi- 
vidually both in the construction and the grading of final examinations. 
This work, carried on by 1,736 high-school teachers, showed that while 
in the best standardized tests both constant and variable errors are dis- 
tinctly less than corresponding errors in examination grades, nevertheless 
written examinations may be so improved that differences in accuracy of 
examination grades and scores may not only be lessened but may yield 
about the reliability of .65 as found in many widely used standardized 
tests. They further believe that their results contradict those of Elliott and 
Starch because the latter failed to differentiate between variable errors 
(those errors due to different educational aims, different standards of 
excellence, etc.) and constant errors (those due to no distinction between 
scores and grades). 

Cochran and Weidemann (361), by setting up standard procedures for 
grading “explain” and “discuss” essay test questions, found consistency 
coefficients ranging from .78 to .98 between experienced scorers in history. 
Thirty-five teachers with experience in several high-school subjects and 
with only ten-minute training in the standard procedures of scoring were 
able to produce an average consistency coefficient between their first and 
second scoring of .78 and an average objectivity coefficient, based on 237 
correlations of .56. 

Leighton (371) criticized the work of Ruch, Elliott, and Starch in two 
ways. First, the experiment with a single paper and several judges giving 
a numerical grade is artificial. He suggested that a more significant 
experiment would be the ratings of a group of papers using the usual 
comparative value grading rather than the percent technic. Second, there 
is no evidence that a definite criterion for judgment was offered for the 
essay test, though such a criterion is deemed extremely important by these 
men when building an objective test. 
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Kinney and Eurich (369) summarized the research on comparison of 
different types of tests saying that there are few conclusive results. They 
pointed out a double need: first, for experiments more coordinated and of 
a wider scope than the usual “controlled experiment;” and second, the 


need for pooling administrative experiences with regard to test construc- 
tion and use. 


The Measurement of Mental Functions 


Corey (362) gave an essay test consisting of 6 questions, and an objec- 
tive test consisting of 96 multiple-choice questions and 13 matching 
questions to 102 students in educational psychology. The correlation 
between the essay and the objective tests when corrected for attenuation 
was .93, thus indicating that the two tests measured the same function. 
The results of the Army Alpha test for the above group, when correlated 
with the essay test and corrected for attenuation, yielded a coefficient of 
.39; but when the Army Alpha was compared with the objective test and 
also corrected for attenuation the coefficient was .62. Corey concluded 
that when the essay and objective tests cover the same subject the latter 
is more closely related to the Army Alpha than is the essay test. The 
results suggest the possibility that other factors of this study may require 
a higher degree of control. 

Paterson (378) gave a one-hour essay test and a one-hour objective test 
to students in his five-hour two-quarter course over a period of two years. 
He found the correlation between the old and new type tests to be .52. 
As the correlation between these two tests were as high (.52) as the 
validity of the lowest (essay) test, he concluded that the two examinations 
were measuring the same thing. It should be noted that the studies by 
Corey and Paterson compared unimproved essay tests with tests consisting 
of two or more forms of improved objective tests. 

Sims (382) gave thirty-three students an essay examination of six recall 
and four discussion questions. Eight readers scored the recall questions with 
a key and scored the discussion questions by sorting them into normal 
distribution groups. All papers in each section of the distribution were 
marked with a predetermined score. The variation among the readers was 
slight, the objectivity of the examination was .97, and the reliability of 
the examination was .84. The recall questions and the discussion questions 
did not seem to measure the same thing as the correlation between them 
when corrected for attenuation was only .53. When compared with an 
objective test taken by the same group the recall questions yielded a 
correlation of 1.01 when corrected for attenuation, while the discussion 
questions yielded a coefficient of .64 when similarly corrected, thus indi- 
cating that the latter type of question measured something quite different 
from that measured by the objective test. 

Cochran and Weidemann (360) conducted an experiment consisting of 
4 to 8 improved essay questions making up an “explain” essay test, and a 
second test consisting of from 50 to 100 simple fact answer questions over 
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the same material covered by the essay test. A set of 4 tests was constructed 
to cover the work of one semester of American history. The tests were 
given every 8 weeks over a period of four consecutive semesters. The 
improved essay test was given first and the simple fact answer test was 
always given on the following day. The median consistency coefficient for 
the 4 simple fact answer tests was .87; the median consistency for the 4 
“explain” essay tests was .55; while the median community of function 
value was .59 with a range from .54 to .82. As a result of the use of supple- 
mentary statistical technics it appeared that the overlapping of mental 
functions measured by the “explain” essay and the simple fact answer 
tests under actual classroom conditions was about 60 percent; that about 
40 percent of the mental functions measured by the former were not 
measured by the latter; and 40 percent of the mental functions measured 
by the latter were not measured by the former. Similar experiments with 
the “discuss” essay and simple word answer fact tests were conducted by 
the same authors (359). The results resembled those for the “explain” 
essay and simple word answer tests. 

Weidemann and Newens (388) experimented with the “compare and 
contrast” essay and true-false tests covering the same material. Each of 
the seven units of test materials was based upon the content of the course 
during the two weeks immediately preceding the administration of the 
given test unit. The consistency of the essay was found to be as high as 
that of the true-false test. Under classroom conditions the two tests over- 
lapped in mental functions to the extent of from 50 to 70 percent in terms 
of median values. Approximately 30 to 40 percent of the mental functions 
measured by the “compare and contrast” essay tests were like 30 to 40 
percent of mental functions measured by the true-false tests. The evidence 
on overlapping of mental functions measured by essay and objective tests 
is insufficient to warrant any definite conclusion. 

The foregoing studies raise the question of what is the nature of the 
mental functions which such tests measure. 


Student Preparation for and Reaction to Testing Programs 

Crawford and Raynaldo (364) made twenty comparisons in fourteen 
different classes. One-half of the students in each class were required to 
take notes on the lecture while the other half just listened. The next day the 
groups were reversed. True-false and essay tests were given after each of 
four such rotations. Those who took notes were inferior as measured by the 
true-false test but superior as measured by the traditional essay test. The 
authors suggested that the true-false test emphasizes unorganized use of 
fact material. However it must be noted that student note-taking does not 
necessitate organization of material. 

From a questionnaire, Douglass and Tallmadge (365) found that 
students preparing for an essay examination said that they read and 
reviewed generalities and trends, attempted to draw several important 
conclusions from tables, formulated personal opinions, and read notes 
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on the text and lectures carefully but without picking out details to be 
memorized. Students stated that when they prepared for objective tests 
they learned tables and minute details of materials covered and tried to 
remember the exact words of the book and other specific points. The proce- 
dure of learning was given by about an equal number of students in study- 
ing either for the essay or objective test. 

Meyer (373), upon examination of studies by Terry (385), Douglass 
and Tallmadge (365), and Crawford (363), concluded that essay exami- 
nations result in a superior type of preparation, more adequate learn- 
ing, and more recall than the true-false, multiple-choice, or completion 
types of examination. For the most part, these results are based upon 
questionnaires. 

Meyer (374) divided 124 students into four groups, each studying for 
three two-hour periods on Civil War history. One group was told to 
study for a true-false test; a second group studied for a multiple-choice 
test; a third group studied for a completion test; and a fourth group 
studied for an essay test. At the end of the study periods all groups were 
given all four types of test. Five weeks later the same testing procedure 
was again followed. Meyer concluded that the examination set which the 
individual has is of fundamental importance in learning and retaining 
sense material. The essay examination set should be used in preference to 
any objective examination set if the student is to recall material in an 
organized fashion and to know facts when cues are not given. 

In the field of student opinion Klise (370) found that of 2,065 students 
at lowa State College 79 percent preferred the true-false, multiple-choice, 
and completion combination of objective tests; 18 percent preferred the 
essay test: and 3 percent were undecided. 

Hastings (367), in experimenting with examinations constructed by 
students in college English, concluded that such procedure leads to student 
review work, increases cooperation between instructor and student, and 
results in an examination which is satisfactory to the instructor. 


Summary 


Research with the essay test has, in the main, been conducted with the 
traditional or unimproved form and comparisons have been made with 
improved forms of objective tests. Such comparisons indicate that the 
improved objective test measures as good as or better than the unimproved 
essay test. When analysis of essay tests into defined types such as list, 
outline, describe, compare, contrast, explain, discuss, develop, evaluate, 
and summarize is achieved for instructional use, comparative studies using 
improved essay types and improved objective types of tests will become 
possible. 

The use of both objective and essay tests seems to be a better basis to 
evaluate the achievement of such varied results as specific, general, and 
organized information. 
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The foregoing studies give a slight suggestion of two developing 
tendencies: 


1. That objective tests are used to measure certain phases of achievement in such 

subjects as mathematics, physics, and biology; while both objective tests and essay 

tests are useful in measuring certain phases of achievement in such subjects as his. 

tory, social sciences, and philosophy. . F 
2. That objective tests are useful on the elementary-school level; while both objec. 

tive and essay tests are useful on the college level. 
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CHAPTER V 
Achievement Tests in Colleges and Universities 


Factors Determining Significant Characteristics of Educational 
Tests 


‘Te cHARACTERISTICS significant for an achievement test are determined 
in any particular case by the uses to which the test is to be put and the 
effects resulting from such uses. Obviously, tests should be so constructed 
and used as to promote rather than hinder important educational values. 
Hence, studies of the uses and effects of tests are essential to the sound 
development of educational testing. 

In the college testing field there have been few carefully conducted 
studies demonstrating potential uses of achievement tests. Investigations of 
desirable and undesirable effects of tests are still more limited. Most 
publications on this topic are arguments for or against various uses, or else 
present very incomplete and inconclusive evidence of values promoted or 
hindered by educational testing. 


Uses of Achievement Tests in Colleges 


In the selection and guidance of college students, achievement tests as 
well as aptitude tests are frequently used. Nevertheless, few studies have 
been made to evaluate this practice. The Committee on Personnel Methods 
of the American Council on Education (390), after reviewing studies 
showing the unreliability of teachers’ marks, of Regents’ Examinations, 
and of College Entrance Board Examinations, proposed to develop many 
forms of comparable tests in the various subject fields on the ground that 
these tests would prove more effective bases for selection and guidance of 
students. No evidence for this claim was presented. The Study of the 
Relation of Secondary and Higher Education in Pennsylvania (425) 
demonstrated wide variability in the functions tested among students 
within the same college, among the average scores of the various colleges, 
and among various prevocational groups within the colleges. This investi- 
gation suggested but did not establish the possible effectiveness of achieve- 
ment tests in educational guidance. The annual reports of the American 
Council’s Committee on Educational Testing (418, 419) gave similar 
data on variability and also suggested the use of tests in guidance. B. D. 
Wood (459) and E. P. Wood (460) maintained that the major function 
of testing is guidance and that this function was best met by cumulative 
records of continued testing. Crawford (402), however, presented data 
from his work with Yale students which clearly questioned the effective- 
ness of present achievement tests in the guidance of students. He also 
doubted the comparability of most tests. 
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Wagner and Strabel (456), in studying the predictability of foreign 
language marks at the University of Buffalo, found that marks in this 
field could be more exactly predicted than in any other field at that 
University. The average of the Regents’ grades in language gave the 
highest predictive index, whereas the cooperative French test did not 
foretell success as well as high-school marks. Reeder (438) found a low 
relation between placement tests and success. Since the criteria of success 
were the marks given by college instructors, the poor showing of certain 
tests in prediction may be a fault of the marks rather than the tests. 
Palmer (436), on the other hand, found a correlation of .70 between pre. 
study and post-study scores when using the cooperative physics test. He 
suggested that such test scores made possible great improvements in guid. 
ance. In support of this position, Terry (447) found that the Van Wagenen 
Reading Test administered at the beginning of a course in educational 
psychology gave a correlation of .72 with the final course examinations. 

With reference to the use of achievement tests in the placement of collece 
students in the foreign languages, Gausewitz (408), Henmou (414), and 
R. E. Monroe (432) collected evidence indicating the value of this proce. 
dure. Brigham (397) reported the new College Entrance Examination 
Board plan for combining entrance testing with placement testing. He 
presented data indicating the greatly increased reliability of the College 
Board Examinations in English. Fletcher (407) presented no data but 
described the use of examinations in giving college credit for post- 
graduate work in high school and for out-of-school educational progress. 
Stalnaker and Richardson (446) defined carefully the criteria demanded 
of tests used for giving scholarships to college students. 

In determining the characteristics of college students, Betts (393) used 
a test on contemporary affairs which he constructed and administered to 
graduate students in education at Northwestern University. Ellis (405) used 
the Iowa High School Academic Contest Test in World History with 
University of Missouri freshmen to discover their achievement in com- 
parison with high-school students. 

Jones (420) found sixty-six colleges using a comprehensive examina- 
tion for the degree in two or more departments. Eighty-five colleges were 
using this type of examination in at least one department. From this investi- 
gation, he contended that the major function of the comprehensive exami- 
nation is to test the student’s facility in bringing to bear upon an exercise 
ideas and facts from several subject fields. 

The most frequently reported use of achievement tests in colleges is in 
studying the results of various types of educational experiences, procedures. 
or materials. Haggerty (409), in the North Central Association’s study 
of standards for higher institutions, used the results of tests as one index 
of the quality of the college. Cheydleur (401) employed foreign lan- 
guage tests to evaluate the success of an experiment in teaching French to 
adults. H. T. Tyler (452) determined the effectiveness of a remedial read- 
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ing course by using the Nelson-Denny and the Iowa reading tests. D. F. 
Miller (431), and Price and J. A. Miller (437), by the use of tests, deter- 
mined the effectiveness of certain methods of segregating students in 
zoology courses at Ohio State University. Meyer (430) appraised remedial 
procedures in zoology by similar tests. Bowden (396) employed the 
Allport-Vernon Test of Social Values in judging the effectiveness of a 
course in social psychology. Sinclair and Tolman (441) employed tests 
to estimate the relative effects upon open-mindedness of an arts college 
course and a technical course. Hartmann and Barrick (411) attempted to 
use the Carnegie General Culture Test to determine the effect of college 
education upon general cultural information. Welborn (457) tried to use 
tests to determine the nature of logical learning in college classes. 

As a means of measuring teaching efficiency both Barr (391) and Hart- 
mann (412) utilized tests. Both presented evidence of the complexity and 
difficulty of the task. 

The studies during the past three years have extended the uses of 
achievement tests in college. Formerly, their uses were usually restricted 
to marking students. More and more they have been used in selecting and 
guiding students, in predicting college success, in placing students within 
courses and within sections of the same course, in assigning advanced 
college credit on the basis of examination, in awarding scholarships, in 
studying the characteristics of college students, in granting diplomas or 
honors, in studying the effectiveness of educational procedures, and in 
evaluating teaching ability. The variety of uses is a chief reason for the 
barrage of criticism which has been directed against many of the tests 
constructed during the past ten years. The commonly accepted technics 
of test construction do not produce tests appropriate for such varied 
usage. 


Values Promoted and Hindered by Testing 


The effects of testing are undoubtedly significant and varied. They are 
probably dependent upon the nature of the test, the procedures employed 
in giving the test, the use to which the test results are put, and the traditions 
of the college as well as the attitude of the individual student. If testing 
is a potential power for educational good or evil, it is most unfortunate 
that we do not have thorough studies of the effects of testing under varying 
conditions. Most publications dealing with this topic present points of view 
rather than data. 

Douglass (403, 404) pointed out what he believed to be crucial dangers 
in the testing movement, namely, the narrowing of educational purposes, 
materials, and procedures because of the failure of tests to cover the 
broader outcomes of education. He emphasized, as another danger, the 
stimulation of students to frantic but not well-directed study. Similarly, 
Krey (423) believed that testing which did not closely parallel the objec- 
tives of instruction was potentially harmful. W. S. Monroe (433) main- 
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tained that test makers had now become the real curriculum makers 
because of the effect of tests in directing the efforts of pupil and teacher. 

Counter claims were also advanced. Hanford (410) reported that the 
comprehensive examinations at Harvard had a beneficial influence on 
instruction by emphasizing the interrelation of courses. He also main. 
tained that the examinations stimulated students toward definite and desir. 
able goals. Boucher (395) found that the comprehensive examination 
program at the University of Chicago had forced instructors to formulate 
their objectives and clarify their purposes. Lowell (429) considered tests 
to have important potentialities as teaching devices. Kulp (424), using 
weekly tests with graduate students, found an increase in the amount of 
learning under these conditions..Keeler (421) maintained that educational 
measurement when it goes hand and hand with instruction stimulates both 
teachers and students to better efforts. Theisen (449) stated that the effects 
of measurement thus far have been to turn educational efforts toward recon. 
struction of curriculums and procedures rather than to continue the status 
quo. Barr (391) claimed that continued educational measurement is 
necessary for the improvement of education and that testing is not in 
itself harmful. Evils come from poor tests or their unintelligent use. 

Lindquist and Anderson (427) attempted to meet the criticism of test- 
ing by distinguishing two types of tests, the general achievement test and 
the diagnostic test. In discussing the general achievement test, they advo- 
cated restricting the test to what has been learned. They would include a 
large proportion of items which test only for understanding at a low level 
and they oppose measurement of desirable emotionalized attitudes. They 
believe that the dangers inherent in such narrow testing may be avoided 
by careful administrative procedures. Richardson and Stalnaker (439) 
claimed that an achievement test need have no pedagogic value and ought 
not to measure effective qualities. Wilson (458), on the contrary, recog- 
nized the influence of testing and advocated a plan of administration by 
the teachers themselves of all tests which affect teaching. Woody (461) also 
admitted the possibility of evil effects from testing but maintained that the 
danger represented a challenge to improve the testing movement. 

This brief summary of conflicting opinions shows clearly the need for 
thorough investigations of effects of various types of tests under varying 
conditions. If testing may be a power for both good and evil, we need to 
know more about the values which may be promoted and hindered by 
testing and the characteristics of tests which will facilitate rather than 
hinder desirable learning. 


Philosophy of College Testing 


The major change in the field of measurement in recent years has been 
the development of a broad philosophy of evaluation. Previously, the 
important issue in test construction was the relative merits of the various 
forms of objective test devices. Should one use a true-false test or a 
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multiple-choice test, a multiple-choice test or a matching test, a completion 
test or a true-false test? Attention was directed to the type of test device 
which could easily be used for particular kinds of subjectmatter. At the 
present time, attention is focused on the kinds of evidence which indicate 
the attainment of various important outcomes of teaching. College courses 
are offered in order to facilitate certain desirable changes in students. 
An achievement test is a means of discovering the degree to which these 
desired changes in students are taking place. The nature of the test will 
vary with the nature of the changes sought. Evidence of the recall of 
information is different from evidence of the use of information. Evidence 
of enjoyment of literature is different from evidence of the ability to judge 
literature. As test makers have come to recognize that they are seeking 
various kinds of evidence, they have come to realize that varied test 
procedures are necessary for collecting the evidence. Educational pur- 
poses or objectives have become increasingly the basis upon which tests 
are constructed. Hence, a first problem of the test maker is to formulate 
the teaching purposes. 


Formulating the Objectives 


Although objectives constitute a curriculum problem, they are also basic 
in developing examinations. When objectives have been formulated in 
designing a curriculum, they are frequently neglected by the test maker. 
Examinations are devised without reference to the purposes of the course or 
of the college. Boucher (394) pointed out that not only the course content 
and methods of instruction but also the evaluating instruments composing 
comprehensive examinations must be determined by the objectives set up 
for courses. 

Johnson (417) stated that the important preliminary step in construct- 
ing tests in science courses is to formulate the objectives of the courses. 
According to Kelley and Krey (422), any examination designed to test 
learning in the social sciences must consider the objectives of teaching in 
this field. Uhl (455) emphasized the importance of appraising heretofore 
neglected outcomes of teaching. The need for getting evidences of changes 
in character and attitudes as well as changes in intellectual accomplish- 
ment was pointed out by Robbins (440). 

Teachers often assume that if a student has acquired much of the infor- 
mation in the course, he has also reached other important objectives to an 
equal degree. Information tests are used as the only measure of achieve- 
ment. Eurich (406) gave evidence that in the achievement of objectives in 
freshman English many relationships are very low and none are high. 
Similar findings were obtained in other fields. Hence, in order to get a 
clear picture of the growth of students, it is important to get evidence of 
changes in the direction of each of the important objectives. 

In the absence of objectives based on curriculum studies, Tyler (453) 
suggested two methods useful in formulgting the objectives. One method 
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was to consider the general purpose of a course and to analyze it into its 
subfunctions. The analysis continues to the point where the objectives 
become clear and useful in teaching and appraisal. A second method was 
to consider the course content, the teaching and administrative procedures, 
and with reference to each topic and each procedure ask: Why is this a 


part of the course? What change in student behavior is this expected to 
bring about? 


Clarifying the Objectives 


Usually after objectives for courses have been formulated, little has 
been done to clarify the objectives in terms of student behavior. Unfortu. 
nately, the objectives remain vague and nebulous statements. They do not 
describe the kind of behavior representing evidences of changes in the 
direction of the objectives. Hence, in constructing examinations, the objec- 
tives have often been passed by. Jumps have been made to a test exercise. 
Research workers in the testing field have frequently skipped the questions, 
“Does the kind of behavior required on this test express an important 
objective of the course? Do the kinds of behavior required in this test 
give evidence of all the important objectives of the course?” These are 
the significant problems when establishing the validity of tests. 

In the field of literature, Carroll (399) defined prose appreciation as 
the ability to distinguish between literary selections previously judged to 
be good or poor. This definition became the basis for his test. The signifi. 
cance of definition is seen by the fact that the test requires of the student, 
kinds of behavior which are different from other kinds of behavior which 
are often called appreciation. Noll (435) maintained that when students 
mark certain carefully chosen statements as true or false or doubtful, the 
students are exhibiting evidences of scientific thinking, consisting of habits 
of accuracy, suspended judgment, open-mindedness, intellectual honesty. 
criticalness, and the habit of looking for true cause and effect relationships. 
The question arises, Is the kind of behavior involved in identifying true and 
false statements, evidence of scientific thinking? Turnbull and Griffith 
(451) reported some objectives of college home economics courses and the 
kind of behavior describing each of these objectives. 


Collecting Test Situations 


A third problem in preparing an examination is that of collecting the 
kinds of situations in which the behavior is expected to take place. This 
aspect is usually neglected. It has been tacitly assumed that a paper and 
pencil situation is the kind of situation in which the behavior is expected 
to take place. If we wish to find out how well a student uses arithmetic in 
situations which he meets in life, his behavior in those kinds of situations 
should be noted and recorded. If we wish to find out how skilful a student 
is in the chemistry laboratory, .we should observe his behavior in the 
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laboratory as he works on a variety of problems. The examination would 
consist of situations which would be representative of the variety of situa- 
tions in which the student has an opportunity to use arithmetic or to work 
in the laboratory. 

What are the conditions under which the behavior is expected? Are 
the students expected to remember the information pertinent in answering 
the questions in an examination or may they have the opportunity to con- 
sult sources of information? Stalnaker and Stalnaker (443, 444) pointed 
out the advantages of and results obtained in an open-book examination. 
There is need to investigate in similar fashion other important conditions 
characterizing test situations. 

A second aspect of the problem of collecting test situations is that of the 
size of the sample. As the sample of situations increases in size, the results 
obtained are more stable. The test is said to be more “reliable.” Lindquist 
and Cook (428) presented evidence showing that the reliability and 
validity coefficients of certain spelling tests are a function of the time 
limits in which the tests are administered. The time limits were based upon 
the percent of students finishing each test. This is a further illustration of 
the conditions of the test situation. How much time should be allowed 
for the test? Should the time limit be that necessary for 25 percent of the 
pupils to finish, or should it be long enough for all pupils to finish? 
These questions go back to the meaning of the objective. If the test is 
designed to collect indirect evidence, the optimum time limit is one which 
gives the best index of the direct evidence. 


Recording the Behavior 


Two chief uses of a record of the student’s behavior are that the be- 
havior may be evaluated at a later time and independently by other in- 
dividuals. On a paper and pencil examination, the student makes his own 
record. It is often assumed that this is the only kind of record of changes 
in the behavior of students. Charters (400) reported the use of the anecdo- 
tal record at Rochester Athenaeum and Mechanics Institute. Significant 
observations of students were recorded. These descriptions of behavior 
could be evaluated independently by others at a later time. Hence, it is 
important in recording the behavior to separate the descriptions of the 
behavior and the evaluations. In an anecdotal record the evidence of be- 
havior is available. In a trait rating device, the evaluation only is re- 
corded. 

A record may also be the product resulting from student behavior. Stu- 
dents’ themes and art productions are evidence of changes in behavior. 
A frog dissected by a student is evidence of the student’s ability to dissect 
a frog. R. W. Tyler (453) reported the use of a checklist in recording the 
behavior of a student in using a microscope and in describing the student’s 
mount. Tharp (448) used a phonograph in recording students’ French 
pronunciations. The records were evaluated at later times by other individ- 
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uals. Slow motion pictures have been used as diagnostic instruments jn 
recording the behavior of athletes and getting recorded evidence of close 
decisions. Likewise motion pictures may be used in recording the behavior 
of a speaker. Photographs may be used as evidences of physical growth. 


Evaluating the Behavior 


Most standardized tests yield a score for a student in a subject. The 
interpretation of the result is in terms of the relative size of the score 
in the subject. Recently there has been a shift in interpretation, from 
achievement in each subject to achievement in each of the important ob. 
jectives of a subject or in an important objective of several subjects. From 
this angle the evaluation of a stulent’s achievement can be made in light 
of the kind of behavior expressing each objective, and not only in terms 
of a score in a subject. 

The essay examination received much criticism due to the variability 
among readers in grading the same examination paper. In grading essay 
examinations, it is assumed that readers are evaluating the student’s re- 
actions on the same objective. This assumption is not always sound. For 
example, in grading a test on ability to use information in new situations, 
some readers forget the objective and grade the test on the amount of 
information the student has stated instead of the degree to which pertinent 
information is used in solving a new problem. Wide variations in the scores 
result. 

Trimble (450) studied the objectivity of oral examinations. The stu- 
dents’ oral responses were evaluated on eight traits by three judges. The 
interrelationship between the ratings of the judges on the traits ranged 
from -.23 to .80. 

Another assumption involved in evaluating the behavior on an essay 
question is that the readers have the same standard of response in mind. 
This assumption can be checked in a meeting of the readers at which time 
the ideas which are expected in the responses are brought together and the 
numerical values allotted to each idea may be decided upon. Each reader 
may be given a copy of these specifications to follow in grading the essay 
questions. Stalnaker aud Stalnaker (445) reported that when readers 
“analyze the ideal answer,” and assign “a certain number of points to each 
significant part of it,” the readers agree more closely. Agreement in evaluat- 
ing English examinations at the University of Chicago increased from .42 
to .92. Horney (415) found close agreement in scoring a test of understand- 
ing chemical principles when the responses were evaluated with a definite 
type of behavior as the standard. 

A prevalent misconception of an objective test was clearly discussed 
by Brownell (398). The misconception assumes that an objective test lacks 
subjectivity and hence is more satisfactory than an essay test. Brownell 
points out many instances where subjectivity and judgment enter into the 
construction, administration, and use of objective tests. Objective tests are 
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objective in that competent individuals agree in the scoring. Objectivity 
is an important characteristic of evidence. However, it should not be pur- 
chased at the expense of the primary characteristic, validity. This leads us 
to another major problem in preparing examinations. 


Practicability of Examinations 


The ease with which evidence of the achievement of an objective of teach- 
ing can be collected and evaluated is an important consideration in an 
evaluation program. A test which is easily administered and quickly scored 
is more likely to be used than one in which the evidence is difficult to col- 
lect and time-consuming to score. There is much evidence to show that 
important outcomes of teaching are not evaluated because practicable 
methods are not available for collecting evidence of progress. 

Practicability is so much desired that teachers have eagerly accepted, 
easily administered, and easily scored paper and pencil examinations and 
have assumed that they were valid. They have not asked about each test, 
Does it require of students the kinds of behavior which represent the ob- 
jectives of my course, or does it give a satisfactory index of this behavior? 

Index devices make an evaluation program more practicable. Although 
they do not yield direct evidence of desired behavior, they are valuable 
when they give a satisfactory index of the direct evidence. In college chemis- 
try courses, an important objective is the ability to apply chemical prin- 
ciples. Hendricks, R. W. Tyler, and Frutchey (413) reported research in 
checking an index device with direct evidence of this outcome. An objec- 
tive of French instruction is the ability to pronounce French words. Direct 
evidence of this ability may be collected about each student as he reads 
and pronounces French words. It is a time-consuming job to obtain this 
kind of evidence for each student and have it evaluated by two or more 
trained judges. Tharp (448) experimented with a paper and pencil index 
device which can be administered in class in five minutes and scored in less 
time. He made phonographic recordings of students’ French pronuncia- 
tions and had the records played and the pronunciations judged by three 
French teachers. The students also took the index examination. The corre- 
lation between the average ratings of the three French teachers and scores 
on the index examination was .84. If this relationship continues to hold 
from time to time with other groups of students and teachers, a practicable 
device is available for collecting evidence of the ability to pronounce 
French. As W. S. Monroe (434) pointed out it is misleading to assume 
that the relationship between direct and indirect tests is fixed. The assump- 
tion should be checked frequently. 


Item Validity 


Several methods of item validation have been used in selecting items 
for a test. Kelley and Krey (422) proposed that a good item for inclusion 
in a test is one whose “difficulty” decreases as the school grade increases. 
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Kelley pointed out that “items selected in this manner can be built into q 
final test which does consistently measure something, but whether it j; 
what the test label calls for remains a matter of judgment.” 

Another method used in item validation is the degree to which an item 
discriminates between the better and poorer students. Differences have 
arisen in choosing an appropriate criterion for selecting the better and 
poorer students. In some cases it is assumed that the total score on the 
general achievement scale is the valid criterion. According to Lindquist 
and Anderson (427), “an item may be said to be perfect in discriminating: 
power when every pupil who responds correctly to the item ranks higher 
on the general achievement scale than any pupil who fails on the item.” 

This method of statistical validation of test items has been used to pick 
out and eliminate from a group of test items those of low “validity” so 
that the test consisting of the remaining items will have higher “validity.” 
Smith (442) proposed to test this hypothesis and concluded from his eyi- 
dence that “so far as obtaining more and more valid instruments of meas. 
urement is concerned, statistical evaluation of individual items apparent!) 
has little to contribute.” | 

The purpose of these methods of statistical validation is to construct 
a test which is consistent with respect to an internal criterion. When the 
criterion is the total score on the test, it is assumed that the test itself is 
valid. The question then arises, What makes the test itself valid? It ap- 
pears that the answer to that question goes back to the meaning of the 
objective—the behavior expressing the objective, the situations in which 
the behavior may occur, and the evaluation of the behavior. The essential 
characteristic of validity is the kind of evidence which expresses the pur- 
pose of the evaluation. 

Statistical methods are useful in making a test internally consistent or 
homogeneous. The process selects items which are highly related to a whole 
group. The kinds of items selected by the procedure are probably those 
kinds which are greatest in number in the total test and which have been 
given the most weight in the scoring. For example, in a test on educational 
statistics, including many items on computation and a few on interpreta- 
tion, statistical validation of the items would discard the items on inter- 
pretation and tend to make the test internally consistent in respect to com- 
putation. Some of the computation items may be discarded also. Statis- 
tical validation in this sense destroys the value of the test for the purposes 
for which it was constructed. The process eliminated opportunities to get 
evidence of the ability of the students to interpret the use of statistical 
measures and some important kinds of behavior in computation. 

Statistical methods are also used in selecting items by checking each 
item with an outside criterion. Horst (416) reported their use in develop- 
ing a test for selecting salesmen. The external criterion was success on 
the job. The purpose was to build a test which ranked the employees in 
the same order as did the measure of success of one’s work on the job. 
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From an original group of tests given the employees, items were selected 
which correlated highly with their success on the job and correlated but 
slightly with each other. In this way a new test was built which had a very 
high predictive value of success of the job. The validity of the new test 
rests upon the degree to which it is a good index of the valid kind of be- 
havior—that which was used as evidence of success on the job. 


Research Considerations 


Lehman and Witty (426) gave seven major objectives of an introductory 
course in psychology and pointed out seven obstacles involved in formu- 
lating objectives. These difficulties are founded on certain assumptions 
which often are not necessary to make. For example, a difficulty is “the 
disparity among the objectives that have been published.” The assumption 
underlying this obstacle is that all introductory courses in psychology 
should have the same objectives. Fortunately, it is not necessary to make 
this assumption in setting up objectives. Teachers or departments may 
formulate the purposes for their own courses. Another difficulty, “our 
limited knowledge of the psychology of learning,” is based on the assump- 
tion that we must know how learning takes place before we can find evi- 
dences that it has takem place. This assumption also is not necessary. If 
we can detect evidences of changes, it is not necessary to know how the 
changes have occurred, for the purposes of formulating objectives. Re- 
search in improving examinations must consider the assumptions under- 
lying an appraisal program. 

The development of testing technics likewise involves a series of as- 
sumptions. R. W. Tyler (453) pointed out some of these assumptions and 
suggested procedures for checking them. “If an appraisal program is to 
develop a sound body of philosophy, techniques, data, inferences, and prin- 
ciples, it is essential that these assumptions be tested one by one, so that 
we may be able to separate findings and methods which are based upon 
valid assumptions from those which are untenable because the assump- 
tions underlying them are not true. This is a task which challenges all who 
work in the field of achievement test construction.” 
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Teaching, aptitude tests for, 225 

Technics, bibliography, 176; research and sur- 
vey, 139 

Test administration, effects of varying condi- 
tions, 227 

Test construction and interpretation, bibliog- 
raphy, 309, 483; of objective achievement, 
469: studies of, 229; trends in, 458 

Testing programs, state and national, 456 

Tests, educational, bibliography, 502; in colleges 
and universities, 491; influence of, 482; in for- 
eign countries, 443, 445; present tendencies, 
455; scoring of, 481; student reaction to, 488 

Tests, physical, in foreign countries, 452 

Tests, psychological, construction and statistical 
interpretation, 229; diagnostic and prognostic, 


in foreign countries, 452; measurement of 
mental functions, 487; of aptitude, 215; of 
character and personality, 242, 273; of intel- 
ligence, 187, 199; of social attitudes, 259; re- 
view of research relating to, 187, use of, in 
foreign countries, 453 

Transportation of pupils, studies of, 156 

Type, size of as related to readability, 56 

Unit costs, methods of studying, 140; of public 
schools, 149 

Validity and reliability, administrative factors 
affecting, 471; of achievement tests, 469 

Ventilation of school buildings. See Heating, 
ventilation, and sanitation in school buildings 


Vocational guidance, use of intelligence tests 
in, 
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